How to remove double quotes(") and new lines in between ," and ", in a unix file - unix

I am getting a comma delimited file with double quotes to string and date fields. we are getting " and new line feeds in string columns like below.
"1234","asdf","with"doublequotes","new line
feed","withmultiple""doublequotes"
want output like
"1234","asdf","withdoublequotes","new linefeed","withmultipledoublequotes"
I have tried
sed 's/\([^",]\)"\([^",]\)/\1\2/g;s/\([^",]\)""/\1"/g;s/""\([^",]\)/"\1/g' < infile > outfile
its removing double quotes in string and removing last double quote like below
"1234","asdf","withdoublequotes","new line
feed","withmultiple"doublequotes
is there a way to remove " and new line feed comes in between ", and ,"

Your substitutions for two consecutive quotes didn't work because they are placed after the substitution for a sole quote, when only one of the two is left.
We could remove " by repeated substitutions (otherwise a quote inserted by the substitution would stay) and new line feed by joining the next input line if the current one's end is no quote:
sed ':1;/[^"]$/{;N;s/\n//;b1;};:0;s/\([^,]\)"\([^,]\)/\1\2/g;t0' <infile >outfile

Related

How to remove the new line when reading from UNIX process groovy? [duplicate]

I have a string that contains some text followed by a blank line. What's the best way to keep the part with text, but remove the whitespace newline from the end?
Use String.trim() method to get rid of whitespaces (spaces, new lines etc.) from the beginning and end of the string.
String trimmedString = myString.trim();
String.replaceAll("[\n\r]", "");
This Java code does exactly what is asked in the title of the question, that is "remove newlines from beginning and end of a string-java":
String.replaceAll("^[\n\r]", "").replaceAll("[\n\r]$", "")
Remove newlines only from the end of the line:
String.replaceAll("[\n\r]$", "")
Remove newlines only from the beginning of the line:
String.replaceAll("^[\n\r]", "")
tl;dr
String cleanString = dirtyString.strip() ; // Call new `String::string` method.
String::strip…
The old String::trim method has a strange definition of whitespace.
As discussed here, Java 11 adds new strip… methods to the String class. These use a more Unicode-savvy definition of whitespace. See the rules of this definition in the class JavaDoc for Character::isWhitespace.
Example code.
String input = " some Thing ";
System.out.println("before->>"+input+"<<-");
input = input.strip();
System.out.println("after->>"+input+"<<-");
Or you can strip just the leading or just the trailing whitespace.
You do not mention exactly what code point(s) make up your newlines. I imagine your newline is likely included in this list of code points targeted by strip:
It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F').
It is '\t', U+0009 HORIZONTAL TABULATION.
It is '\n', U+000A LINE FEED.
It is '\u000B', U+000B VERTICAL TABULATION.
It is '\f', U+000C FORM FEED.
It is '\r', U+000D CARRIAGE RETURN.
It is '\u001C', U+001C FILE SEPARATOR.
It is '\u001D', U+001D GROUP SEPARATOR.
It is '\u001E', U+001E RECORD SEPARATOR.
It is '\u001F', U+0
If your string is potentially null, consider using StringUtils.trim() - the null-safe version of String.trim().
If you only want to remove line breaks (not spaces, tabs) at the beginning and end of a String (not inbetween), then you can use this approach:
Use a regular expressions to remove carriage returns (\\r) and line feeds (\\n) from the beginning (^) and ending ($) of a string:
s = s.replaceAll("(^[\\r\\n]+|[\\r\\n]+$)", "")
Complete Example:
public class RemoveLineBreaks {
public static void main(String[] args) {
var s = "\nHello world\nHello everyone\n";
System.out.println("before: >"+s+"<");
s = s.replaceAll("(^[\\r\\n]+|[\\r\\n]+$)", "");
System.out.println("after: >"+s+"<");
}
}
It outputs:
before: >
Hello world
Hello everyone
<
after: >Hello world
Hello everyone<
I'm going to add an answer to this as well because, while I had the same question, the provided answer did not suffice. Given some thought, I realized that this can be done very easily with a regular expression.
To remove newlines from the beginning:
// Trim left
String[] a = "\n\nfrom the beginning\n\n".split("^\\n+", 2);
System.out.println("-" + (a.length > 1 ? a[1] : a[0]) + "-");
and end of a string:
// Trim right
String z = "\n\nfrom the end\n\n";
System.out.println("-" + z.split("\\n+$", 2)[0] + "-");
I'm certain that this is not the most performance efficient way of trimming a string. But it does appear to be the cleanest and simplest way to inline such an operation.
Note that the same method can be done to trim any variation and combination of characters from either end as it's a simple regex.
Try this
function replaceNewLine(str) {
return str.replace(/[\n\r]/g, "");
}
String trimStartEnd = "\n TestString1 linebreak1\nlinebreak2\nlinebreak3\n TestString2 \n";
System.out.println("Original String : [" + trimStartEnd + "]");
System.out.println("-----------------------------");
System.out.println("Result String : [" + trimStartEnd.replaceAll("^(\\r\\n|[\\n\\x0B\\x0C\\r\\u0085\\u2028\\u2029])|(\\r\\n|[\\n\\x0B\\x0C\\r\\u0085\\u2028\\u2029])$", "") + "]");
Start of a string = ^ ,
End of a string = $ ,
regex combination = | ,
Linebreak = \r\n|[\n\x0B\x0C\r\u0085\u2028\u2029]
Another elegant solution.
String myString = "\nLogbasex\n";
myString = org.apache.commons.lang3.StringUtils.strip(myString, "\n");
For anyone else looking for answer to the question when dealing with different linebreaks:
string.replaceAll("(\n|\r|\r\n)$", ""); // Java 7
string.replaceAll("\\R$", ""); // Java 8
This should remove exactly the last line break and preserve all other whitespace from string and work with Unix (\n), Windows (\r\n) and old Mac (\r) line breaks: https://stackoverflow.com/a/20056634, https://stackoverflow.com/a/49791415. "\\R" is matcher introduced in Java 8 in Pattern class: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
This passes these tests:
// Windows:
value = "\r\n test \r\n value \r\n";
assertEquals("\r\n test \r\n value ", value.replaceAll("\\R$", ""));
// Unix:
value = "\n test \n value \n";
assertEquals("\n test \n value ", value.replaceAll("\\R$", ""));
// Old Mac:
value = "\r test \r value \r";
assertEquals("\r test \r value ", value.replaceAll("\\R$", ""));
String text = readFileAsString("textfile.txt");
text = text.replace("\n", "").replace("\r", "");

How to remove line breaks in particular column while exporting data in db2

I am trying to export data from a DB2 database to a text file Each column is surrounded by "double quotes" and separated by semicolon however there is one column that contains line breaks. Is there anyway so that I can remove line breaks and export as single line while exporting
Example
test.txt:
1."123","qweeerr","qqqqqq
2. rrrrr
3. hhhhhh","sdfsfs"
I need output like below in test.xt
1. "123","qweeerr","qqqqqq rrrrr hhhhhh","sdfsfs"
You can do this:
mayankp#mayank:~/$ tr '\n' ' ' < test.txt
"123","qweeerr","qqqqqq rrrrr hhhhhh","sdfsfs"
Changing all x'0D' and x'0A' characters to a space.
If you want to remove them, specify '' as the last parameter instead of ' '.
select translate(s, '', x'0d0a', ' ')
from table(values 'a'||x'0d'||x'0a'||'b', 'a'||x'0a'||'b') t(s);
$ sed -z 's/\n"/"/g' test.txt

combining strings to one string in r

I'm trying to combine some stings to one. In the end this string should be generated:
//*[#id="coll276"]
So my inner part of the string is an vector: tag <- 'coll276'
I already used the paste() method like this:
paste('//*[#id="',tag,'"]', sep = "")
But my result looks like following: //*[#id=\"coll276\"]
I don't why R is putting some \ into my string, but how can I fix this problem?
Thanks a lot!
tldr: Don't worry about them, they're not really there. It's just something added by print
Those \ are escape characters that tell R to ignore the special properties of the characters that follow them. Look at the output of your paste function:
paste('//*[#id="',tag,'"]', sep = "")
[1] "//*[#id=\"coll276\"]"
You'll see that the output, since it is a string, is enclosed in double quotes "". Normally, the double quotes inside your string would break the string up into two strings with bare code in the middle:
"//*[#id\" coll276 "]"
To prevent this, R "escapes" the quotes in your string so they don't do this. This is just a visual effect. If you write your string to a file, you'll see that those escaping \ aren't actually there:
write(paste('//*[#id="',tag,'"]', sep = ""), 'out.txt')
This is what is in the file:
//*[#id="coll276"]
You can use cat to print the exact value of the string to the console (Thanks #LukeC):
cat(paste('//*[#id="',tag,'"]', sep = ""))
//*[#id="coll276"]
Or use single quotes (if possible):
paste('//*[#id=\'',tag,'\']', sep = "")
[1] "//*[#id='coll276']"

How can I delete every single letter of a row after a certain character in R?

I am having a problem doing a cleaning of transactions. I have an excel with every single transaction that clients do, with the number, the gloss and the code of the industry. I convert this excel in text separated by ";" then I only need to clean the gloss and convert it back again into an excel.
tolower(tabla1)
lapply(tabla1, tolower)
tabla1[] <- lapply(tabla1, tolower)
str(tabla1)
tabla1
tabla1_texto <- gsub("[.]", "", tabla1)
table1_texto <- gsub("[(]", " ", tabla1_texto)
I know that I need to use gsub() but I'm not sure how to use it, in other hand, someone know how to do a correct dictionary and only keep certain words and delete every other word?
If you have a string like this one:
string <- "Some text here; and some text here; and some more text here"
Then you can delete everything after the first ; with:
gsub(";.*$", "", string)
[1] "Some text here"
Explanation of ;,*$ which you will be substituting for "" (empty string):
starting with ;
any character . zero or more times *
up until the end of the line $
If you have a table - you will have to do this for every row separately.

Progress / 4GL: Export table as .csv with column names and double quotes?

I'm trying to write a .p script that will export a table from a database as a csv. The following code creates the csv:
OUTPUT TO VALUE ("C:\Users\Admin\Desktop\test.csv").
FOR EACH table-name NO-LOCK:
EXPORT DELIMITER "," table-name.
END.
OUTPUT CLOSE.
QUIT.
However, I can't figure how to encapsulate all of the fields with double quotes. Nor can I figure out how to get the first row of the .csv to have the column names of the table. How would one go about doing this?
I'm very new to Progress / 4GL. Originally I was using R and an ODBC connection to import and format the table before saving it as a csv. But I've learned that the ODBC driver I'm using does not work reliably...sometimes it will not return all the rows in the table.
The ultimate goal is to pass an argument (table-name) to a .p script that will export the table as a csv. Then I can import the csv in R, manipulate / format the data and then export the table again as a csv.
Any advice would be greatly appreciated.
EDIT:
The version of Progress I am using is 9.1D
Using the above code, the output might look like this...
"ACME",01,"Some note that may contain carriage returns.\n More text",yes,"01A"
The reason for trying to encapsulate every field with double quotes is because some fields may contain carriage returns or other special characters. R doesn't always like carriage return in the middle of field. So the desired output would be...
"ACME","01","Some note that may contain carriage returns.\n More text","yes","01A"
Progress version is important to know. Your ODBC issue is likely caused by the fact that formats in Progress are default display formats and don't actually limit the amount of data to be stored. Which of course drives SQL mad.
You can use this KB to learn about the DBTool utility to fix the SQL width http://knowledgebase.progress.com/articles/Article/P24496
As far as the export is concerned what you are doing will already take care of the double quotes for character columns. You have a few options to solve your header issue depending on your version of Progress. This one will work no matter your version but is not as elegant as the newer options....
Basically copy this into the procedure editor and it will generate a program with internal procedures for each table in your DB. Run the csvdump.p by passing in the table name and the csv file you want( run csvdump.p ("mytable","myfile").
Disclaimer you may run into some odd datatypes that can't be exported like RAW but they aren't very common.
DEF VAR i AS INTEGER NO-UNDO.
OUTPUT TO csvdump.p.
PUT UNFORMATTED
"define input parameter ipTable as character no-undo." SKIP
"define input parameter ipFile as character no-undo." SKIP(1)
"OUTPUT TO VALUE(ipFile)." SKIP(1)
"RUN VALUE('ip_' + ipTable)." SKIP(1)
"OUTPUT CLOSE." SKIP(1).
FOR EACH _file WHERE _file._tbl-type = "T" NO-LOCK:
PUT UNFORMATTED "PROCEDURE ip_" _file._file-name ":" SKIP(1)
"EXPORT DELIMITER "~",~"" SKIP.
FOR EACH _field OF _File NO-LOCK BY _Field._Order:
IF _Field._Extent = 0 THEN
PUT UNFORMATTED "~"" _Field-Name "~"" SKIP.
ELSE DO i = 1 TO _Field._Extent:
PUT UNFORMATTED "~"" _Field-Name STRING(i,"999") "~"" SKIP.
END.
END.
PUT UNFORMATTED "." SKIP(1)
"FOR EACH " _File._File-name " NO-LOCK:" SKIP
" EXPORT DELIMITER "~",~" " _File._File-Name "." SKIP
"END." SKIP(1).
PUT UNFORMATTED "END PROCEDURE." SKIP(1).
END.
OUTPUT CLOSE.
BIG Disclaimer.... I don't have 9.1D to test with since it is well past the supported date.... I believe all of this will work though.
There are other ways to do this even in 9.1D (dynamic queries) but this will probably be easier for you to modify if needed since you are new to Progress. Plus it is likely to perform better than purely dynamic exports. You can keep nesting the REPLACE functions to get rid of more and more characters... or just copy the replace line and let it run it over and over if needed.
DEF VAR i AS INTEGER NO-UNDO.
FUNCTION fn_Export RETURNS CHARACTER (INPUT ipExtent AS INTEGER):
IF _Field._Data-Type = "CHARACTER" THEN
PUT UNFORMATTED "fn_Trim(".
PUT UNFORMATTED _File._File-Name "." _Field._Field-Name.
IF ipExtent > 0 THEN
PUT UNFORMATTED "[" STRING(ipExtent) "]" SKIP.
IF _Field._Data-Type = "CHARACTER" THEN
PUT UNFORMATTED ")".
PUT UNFORMATTED SKIP.
END.
OUTPUT TO c:\temp\wks.p.
PUT UNFORMATTED
"define input parameter ipTable as character no-undo." SKIP
"define input parameter ipFile as character no-undo." SKIP(1)
"function fn_Trim returns character (input ipChar as character):" SKIP
" define variable cTemp as character no-undo." SKIP(1)
" if ipChar = '' or ipChar = ? then return ipChar." SKIP(1)
" cTemp = replace(replace(ipChar,CHR(13),''),CHR(11),'')." SKIP(1)
" return cTemp." SKIP(1)
"end." SKIP(1)
"OUTPUT TO VALUE(ipFile)." SKIP(1)
"RUN VALUE('ip_' + ipTable)." SKIP(1)
"OUTPUT CLOSE." SKIP(1).
FOR EACH _file WHERE _file._tbl-type = "T" NO-LOCK:
PUT UNFORMATTED "PROCEDURE ip_" _file._file-name ":" SKIP(1)
"EXPORT DELIMITER "~",~"" SKIP.
FOR EACH _field OF _File NO-LOCK BY _Field._Order:
IF _Field._Extent = 0 THEN
PUT UNFORMATTED "~"" _Field-Name "~"" SKIP.
ELSE DO i = 1 TO _Field._Extent:
PUT UNFORMATTED "~"" _Field-Name STRING(i) "~"" SKIP.
END.
END.
PUT UNFORMATTED "." SKIP(1)
"FOR EACH " _File._File-name " NO-LOCK:" SKIP.
PUT UNFORMATTED "EXPORT DELIMITER ~",~"" SKIP.
FOR EACH _field OF _File NO-LOCK BY _Field._Order:
IF _Field._Extent = 0 OR _Field._Extent = ? THEN
fn_Export(0).
ELSE DO i = 1 TO _Field._Extent:
fn_Export(i).
END.
END.
PUT UNFORMATTED "." SKIP(1)
"END." SKIP(1).
PUT UNFORMATTED "END PROCEDURE." SKIP(1).
END.
OUTPUT CLOSE.
I beg to differ on one small point with #TheMadDBA. using EXPORT will not deal with quoting all the fields in your output in CSV style. Logical fields, for example, will not be quoted.
'CSV Format' is the vaguest of standards, but the export command does not conform with it. It was not designed for that. (I notice that in #TheMadDBA's final example, they do not use export, either.)
If you want all the non-numeric fields quoted, you need to handle this yourself.
def stream s.
output stream s to value(v-filename).
for each tablename no-lock:
put stream s unformatted
'"' tablename.charfield1 '"'
',' string(tablename.numfield)
',"' tablename.charfield2 '"'
skip.
end.
output stream s close.
In this example I'm assuming that you are okay with coding a specific dump for a single table, rather than a generic solution. You can certainly do the latter with meta-programming as in #TheMadDBA's answer, with ABL's dynamic query syntax, or even with -- may the gods forgive us both -- include files. But that's a more advanced topic, and you said you were just starting with ABL.
You will still have to deal with string truncation as per #TheMadDBAs answer.
After some inspiration from #TheMadDBAs and additional thought here is my solution to the problem...
I decided to write a script in R that would generate the p scripts. The R script uses one input, the table name and dumps out the p script.
Below is a sample p script...
DEFINE VAR columnNames AS CHARACTER.
columnNames = """" + "Company" + """" + "|" + """" + "ABCCode" + """" + "|" + """" + "MinDollarVolume" + """" + "|" + """" + "MinUnitCost" + """" + "|" + """" + "CountFreq" + """".
/* Define the temp-table */
DEFINE TEMP-TABLE tempTable
FIELD tCompany AS CHARACTER
FIELD tABCCode AS CHARACTER
FIELD tMinDollarVolume AS CHARACTER
FIELD tMinUnitCost AS CHARACTER
FIELD tCountFreq AS CHARACTER.
FOR EACH ABCCode NO-LOCK:
CREATE tempTable.
tempTable.tCompany = STRING(Company).
tempTable.tABCCode = STRING(ABCCode).
tempTable.tMinDollarVolume = STRING(MinDollarVolume).
tempTable.tMinUnitCost = STRING(MinUnitCost).
tempTable.tCountFreq = STRING(CountFreq).
END.
OUTPUT TO VALUE ("C:\Users\Admin\Desktop\ABCCode.csv").
/* Output the column names */
PUT UNFORMATTED columnNames.
PUT UNFORMATTED "" SKIP.
/* Output the temp-table */
FOR EACH tempTable NO-LOCK:
EXPORT DELIMITER "|" tempTable.
END.
OUTPUT CLOSE.
QUIT.
/* Done */
The R script makes an ODBC call to the DB to get the column names for the table of interest and then populates the template to generate the p script.
I'm not sure creating a temp table and casting everything as a character is the best way of solving the problem, but...
we have column names
everything is encapsulated in double quotes
and we can choose any delimiter (e.g. "|" instead of ",")

Resources