Line feed leaving extra space at end of line in XQuery - xquery

I have an application where I create a .csv file and then create a .xlsx file from the .csv file. The problem I am having now is that the end of rows in the .csv have a trailing space. My database is in MarkLogic and I am using a custom REST endpoint to create this. The code where I create the .csv file is:
document{($header-row,for $each in $data-rows return fn:concat('
',$each))}
I pass in the header row then pass each data row with a carriage return and a line feed at the beginning. I do this to put the cr/lf at the end of the header and then each of the other lines except for the last one. I know my data rows do not have the space at the end. I have tried to normalize the space around $each and the extra space is still present. It seems to be tied to the line feed. Any suggestions on getting rid of that space?
Example of current output:
Name,Email,Phone
Bill,bill#company.com,999-999-9999
Steve,steve#company.com,999-999-9999
Bob,bob#company.com,999-999-9999
. . .

You are creating a document that is text() node from a sequence of strings. When those are combined to create a single text(), they will be separated by a space. For instance, this:
text{("a","b","c")}
will produce this:
"a b c"
Change your code to use string-join() to join your $header-row and sequence of $data-rows string values with the
separator:
document{ string-join(($header-row, $data-rows), "
") }

Related

Error in loading a data to neo4j in case statement in cypher query

try to upload CSV data with a case statement in the query, but the following error appears:
cypher:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///test.csv' as line
MATCH(a:test_t{tid:line.pid})
CASE
WHEN line.key !='NA' THEN
WITH split(line.key,",") as name
UNWIND name as x
MERGE(k:test_key{key_term:toLower(x)})
MERGE(a)-[:contains]->(k)
END
Error
Neo.ClientError.Statement.SyntaxError: Invalid input 'S': expected 'l/L' (line 5, column 3 (offset: 137))
"CASE"
Can anyone help me?
The CASE clause does not support embedding other Cypher clauses (but it can invoke functions). In fact, a CASE clause is not actually needed for your use case.
This query should work (the :auto at the beginning is needed in neo4j 4.0+):
:auto USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///test.csv' as line FIELDTERMINATOR ';'
WITH line
WHERE line.key <> 'NA'
MATCH (a:test_t {tid: line.pid})
UNWIND split(line.key, ',') as x
MERGE (k:test_key {key_term: TOLOWER(x)})
MERGE (a)-[:contains]->(k)
This query filters out all unwanted lines as soon as they are obtained from the file. Reducing the number of rows of data being worked on as early as possible is good practice.
Also, you have a second issue. Your data file cannot use the comma as both the (default) field terminator AND as the delimiter between your x values.
To resolve this ambiguity, the above query chose to use the FIELDTERMINATOR ';' option to specify that the ";" character will be used as the field terminator. A sample data file would look like this:
pid;key
123;NA
234;Foo,Bar
345;Bar,Baz
456;NA
567;Baz
You are using the CASE incorrectly. You cannot have update clauses inside of a CASE statement. Instead you can use a WHERE clause to filter the rows of the file. For instance, adding WHERE line.key != 'NA' while processing the file befor you move onto the update will work. Something like this should fit the bill.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///test.csv' as line
MATCH (a:test_t {tid: line.pid})
WITH line
WHERE line.key <> 'NA'
WITH split(line.key, ",") as name
UNWIND name as x
MERGE (k:test_key {key_term: toLower(x)})
MERGE (a)-[:contains]->(k)
It looks like,from your logic you could even move the test up above the MATCH. So this might be better (fewer unecessary matches).
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///test.csv' as line
WITH line
WHERE line.key <> 'NA'
MATCH (a:test_t {tid: line.pid})
WITH split(line.key, ",") as name
UNWIND name as x
MERGE (k:test_key {key_term: toLower(x)})
MERGE (a)-[:contains]->(k)

Xquery preserve spaces while tokenizing

I am trying to achieve the below using XQuery
Input
<DemoXML>
This is a sample line one
this is line number two
this line contains multiple spaces
paragraph ends
</DemoXML
Required Output(Two Records)
<Record1>
This is a sample line one
this line contains multiple spaces
paragraph ends
</Record1>
<Record2>
This is a sample line one
this line contains multiple spaces
paragraph ends
</Record2>
I tried using Tokenize but the problem is tokenize function removes all the 'Spaces' in the secondline.
this is line number two
fn:tokenize($input,'\n')
Tokenize Output
This is a sample line one
this is line number two
this line contains multiple spaces
paragraph ends
Can someone let me know a workaround plz
Your attached query is working fine. Also attached generated output for your reference. May be issue in processor which you are using. I test this query in Marklogic console and Oxygen Editor with XQuery 9.6.0.7
let $val1:=
This is a sample line one
this is line number two
this line contains multiple spaces
paragraph ends
return tokenize($val1,'\n')
Generate Output:
This is a sample line one this is line number two this line contains multiple spaces paragraph ends

Dealing with quotation marks in a quote-surrounded string

Take this CSV file:
ID,NAME,VALUE
1,Blah,100
2,"Has space",200
3,"Ends with quotes"",300
4,""Surrounded with quotes"",300
It loads just fine in most statistical programs (R, SAS, etc.) but in Excel the third row is misinterpreted because it has two quotation marks. Escaping the last quote as \" will also not work in Excel. The only way I have found so far is to replace the one double quote with two double quotes:
ID,NAME,VALUE
1,Blah,100
2,"Has space",200
3,"Ends with quotes""",300
4,"""Surrounded with quotes""",300
But that would render the file completely useless for all other programs (R, SAS, etc.)
Is there a way to format the CSV file where strings can begin or end with the same characters as that used to surround them, such that it would work in Excel as well as commonly used statistical software?
Your second representation is the normal way to generate a CSV file and so should be easy to work with in any software. See the RFC 4180 specifications. https://www.ietf.org/rfc/rfc4180.txt
So your second example represents this data:
Obs id name value
1 1 Blah 100
2 2 Has space 200
3 3 Ends with quotes" 300
4 4 "Surrounded with quotes" 300
If you want to represent it as a delimited file where none of the values are allowed to contain the delimiter (in other words NOT as a standard CSV file) than it would look like:
id,name,value
1,Blah,100
2,Has space,200
3,Ends with quotes",300
4,"Surrounded with quotes",300
But if you want to allow the values to contain the delimiter then you need some way to distinguish embedded delimiters from real delimiters. So the standard forces values that contain the delimiter to be quoted. But once you do that you also need to also add quotes around fields that contain the quote character itself (and double the embedded quotes) to avoid making an ambiguous file. For example the quotes in the 4th observation in your first file look like they are optional quotes around a value instead of part of the value.
Many programs try to handle ambiguous situations. For example SAS does not allow values to contain embedded line breaks so you will always get four observations with your first example file.
But EXCEL allows the embedding of the end of line character(s) inside of quoted values. So in your original file the value of the second field in the third observations looks like what you would start to get if you added quotes around this value:
Ends with quotes",300
4,"Surrounded with quotes",300
So instead of 4 complete observations of three fields values in each there are only three observations and the last observation has only two field values.
This is caused by the fact that escape character for " in Excel is "": Escaping quotes and delimiters in CSV files with Excel
A quick and simple workaround that comes to mind in R is to first read the content of the csv with readLines, then replace the double (escaped) double quotes with just one double quotes, and then read.table:
read.table(
text = gsub(pattern = "\"\"", "\"", readLines("data.csv")),
sep = ",",
header = TRUE
)

Want a commandline to get the data as it is as we get when we export data

I have a data that is having some thousands of records and each record having multiple columns.One of the column is having a data where there is a punctuation mark "," in it.
When I had tried to spool that data into a csv file and text to columns data using the delimters as comma,the data seems to be inappropriate as the data itself has a comma in it.
I am looking for a solution where I can export the data using a command line which is having as it is look when I export the data via TOAD.
Any help is much appreciated.
Note: I was looking for this solution since many days but got a chance now to post it here.
When exporting the dataset in Toad, select a delimiter other than a comma or drop down the "string quoting" dropdown box and select "double quote strings including NULLS".
Oh wait if you are spooling output, you'll need to add the double-quotes in your select statement like this in order to surround the columns containing the comma with double-quotes:
select '"' || column || '"' as column from table;
This format is pretty standard but use pipes as delimiters instead and save space by not having to wrap strings in double-quotes. Depends on what the consumer of the data requires really.

Delete a chosen line from a text file using ASP?

I have a loop that finds duplicate lines in a .ini file. I can happily find the duplicate lines, and write new lines to the file using FileSystemObject, however... I can't seem to find out how to delete the duplicate lines. What I want to do is delete the lines by line number as I have identified the relevant line number already.
Is there a native way to do this, or is it a case of rewriting the file minus the duplicate lines?
Any help is greatly appreciated.
Thanks.
My method of finding the duplicate entry is as follows:
Do While Not file.AtEndOfStream
intLineNumber = intLineNumber + 1
strReadLineText = file.ReadLine
If strSearchText <> "" And InStr(strReadLineText, strSearchText) > 0 Then
session("message") = "Line Exists on " + Cstr(intLineNumber)
'' # delete duplicate line...
End If
Loop
file.Close()
You can see where my comment is, is where I wish to remove the line that is found.
There are no built-in methods to do this in VBScript (or the intrinsic objects).
My vote would be to use string concatenation to accomplish this; i.e. instead of deleting lines, create a new file, line by line, and only append to the string if the line doesn't exist in the string you're building.

Resources