Xquery preserve spaces while tokenizing - xquery

I am trying to achieve the below using XQuery
Input
<DemoXML>
This is a sample line one
this is line number two
this line contains multiple spaces
paragraph ends
</DemoXML
Required Output(Two Records)
<Record1>
This is a sample line one
this line contains multiple spaces
paragraph ends
</Record1>
<Record2>
This is a sample line one
this line contains multiple spaces
paragraph ends
</Record2>
I tried using Tokenize but the problem is tokenize function removes all the 'Spaces' in the secondline.
this is line number two
fn:tokenize($input,'\n')
Tokenize Output
This is a sample line one
this is line number two
this line contains multiple spaces
paragraph ends
Can someone let me know a workaround plz

Your attached query is working fine. Also attached generated output for your reference. May be issue in processor which you are using. I test this query in Marklogic console and Oxygen Editor with XQuery 9.6.0.7
let $val1:=
This is a sample line one
this is line number two
this line contains multiple spaces
paragraph ends
return tokenize($val1,'\n')
Generate Output:
This is a sample line one this is line number two this line contains multiple spaces paragraph ends

Related

Line feed leaving extra space at end of line in XQuery

I have an application where I create a .csv file and then create a .xlsx file from the .csv file. The problem I am having now is that the end of rows in the .csv have a trailing space. My database is in MarkLogic and I am using a custom REST endpoint to create this. The code where I create the .csv file is:
document{($header-row,for $each in $data-rows return fn:concat('
',$each))}
I pass in the header row then pass each data row with a carriage return and a line feed at the beginning. I do this to put the cr/lf at the end of the header and then each of the other lines except for the last one. I know my data rows do not have the space at the end. I have tried to normalize the space around $each and the extra space is still present. It seems to be tied to the line feed. Any suggestions on getting rid of that space?
Example of current output:
Name,Email,Phone
Bill,bill#company.com,999-999-9999
Steve,steve#company.com,999-999-9999
Bob,bob#company.com,999-999-9999
. . .
You are creating a document that is text() node from a sequence of strings. When those are combined to create a single text(), they will be separated by a space. For instance, this:
text{("a","b","c")}
will produce this:
"a b c"
Change your code to use string-join() to join your $header-row and sequence of $data-rows string values with the
separator:
document{ string-join(($header-row, $data-rows), "
") }

How do I extract a section number and the text after it?

I have a question.
My text file contains lines such as:
1.1        Description.
This is the description.
1.1.1      Quality Assurance
Random sentence.
1.6.1    Quality Control. Quality Control is the responsibility of the contractor.
I'm trying to find out how to get:
1.1        Description
1.1.1      Quality Assurance
1.6.1    Quality Control
Right now, I have:
txt1 <- readLines("text1.txt")
txt2<-grep("^[0-9.]+", txt1, value = TRUE)
file<-write(txt2, "text3.txt")
which results in:
1.1        Description.
1.1.1      Quality Assurance
1.6.1    Quality Control. Quality Control is the responsibility of the contractor.
You are using grep with value=TRUE, which
returns a character vector containing the selected elements of x
(after coercion, preserving names but no other attributes).
This means, that if your regular expression matches anything in the line, the all line will be returned. You managed to build your regular expression to match numbers in the begining of the line. So all the lines which begin with numbers get selected.
It seems that your goal is not to select the all line, but to select only until there is a line break or a period.
So, you need to adjust the regular expression to be more specific, and you need to extract only the matching portion of the line.
A regular expression that matches what you want can be:
"^([0-9]\\.?)+ .+?(\\.|$)"
It selects numbers with dots, followed by a space, followed by anything, and stops matching things when a . comes or the line ends. I recommend the following website to better understand what the regex does: https://regexr.com/
The next step is extracting from the given lines only the matching portion, and not the all line where the regex has a match. For this we'll use the function regexpr, which tells us where the matches are, and the function regmatches, which helps us extract those matches:
txt1 <- readLines("text.txt")
regmatches(txt1, regexpr("^([0-9]\\.?)+ .+?(\\.|$)", txt1))

How do i match a a key word from a split string to a text file - Python

I want to take the user's input, splitinto separate words match a specific key word in input back to a text file, when key word is matched in the text file it prints the line it is on..
replace your line of code: "if problem in line:" with:
if len(list(set(problem) & set(line.split())))>0:
add explanation as requested:
1) line.split() will turn a line of text into a new list.
2) set(list1) & set(list2) will produce an intersection of two lists.
3) if length of intersection from two lists is 0, it means: nothing is common in two list.
hope this helps.

Split text file into paragraph files in R

I'm trying to split a huge .txt file into multiples .txt files containing just one paragraph each.
Let me provide an example. I would need a text like this:
This is the first paragraph. It makes no sense because is just an example.
This a second paragraph, as meaningless as the previous one.
Saved as two independent .txt files containing the first paragraph (the first file) and the second paragraph (the second file).
The first file would have only: "This is the first paragraph. It makes no sense because is just an example."
And the second one: "This a second paragraph, as meaningless as the previous one."
And the same for the whole text. In the huge .txt file paragraphs are divided by one or several empty lines. Ideas?
Thank you very much!
I created a 3 paragraph example and am using your comment here to recreate what I think you're describing.
text <- "This is the first paragraph. It makes no sense because is just an example. Nothing makes sense and I'm trying to understand what I'm doing with life. This paragraph does not seem to end.
What are we doing here.
This a second paragraph, as meaningless as the previous one.
There's too much to do - this is meaningless though.
Wow, that's funny."
paras <- unlist(strsplit(text, "\n\n"))
for (i in 1:length(paras)) {
write.table(paras[i], file = paste0("paragraph", i, ".txt"), row.names = F)
}
This code first assigns the value to the variable text and is followed bu the use of the strsplit function with the argument "\n\n" to split the text at each double newline character.
Then, a for loop is used to go through each element and save it into a separate .txt file.

How to escape new line in man pages

I have refactored a man page's paragraph so that each sentence is it's own line. When rendering with man ./somefile.3 The output is slightly different.
Let me show an example:
This is line 1. This is line 2.
vs.
This is line 1.
This is line 2.
Are rendering like so:
First:
This is line 1. This is line 2.
Second:
This is line 1. This is line 2.
There is an extra space between the sentences. Note that I have made sure that there is no extra white space. I have more experience with Latex, asciidoc, and markdown and I can control that there, is it possible with troff/groff? I'd like to avoid that if possible. I don't think it should be there.
The troff input standard is to have a newline at the end of each sentence, and to let the typesetter do its job with filling. (Althought I doubt it was the intent, it does make it play nicer with source control.) Therefore, it considers sentence ends to be at the end of a line that ends with a period (or ? or !, and optionally followed by ',",*,],),or †). It also believes that sentences should have two spaces between them. This almost certainly derives from the typography standards at Bell Labs at the time; It's rather curious that this behavior is not settable through any fill modes.
groff does provide a way to set the "inter-sentence" spacing, with the extended .ss request:
.ss word_space_size [sentence_space_size]
Change the size of a space between words. It takes its units as one
twelfth of the space width parameter for the current font. Initially
both the word_space_size and sentence_space_size are 12. In fill mode,
the values specify the minimum distance.
If two arguments are given to the ss request, the second argument sets
the sentence space size. If the second argument is not given, sentence
space size is set to word_space_size. The sentence space size is used
in two circumstances: If the end of a sentence occurs at the end of a
line in fill mode, then both an inter-word space and a sentence space
are added; if two spaces follow the end of a sentence in the middle of
a line, then the second space is a sentence space. If a second
argument is never given to the ss request, the behaviour of UNIX troff
is the same as that exhibited by GNU troff. In GNU troff, as in UNIX
troff, a sentence should always be followed by either a newline or two
spaces.
So you can specify that the "sentence space" should be zero-width by making the request
.ss 12 0
As far as I know, this is a groff extension; heirloom troff supports it, but older dwb derived versions may not.
Example:
This is line 1. This is line 2.
This is line 1. This is line 2.
This is line 1.
This is line 2.
SET SENTENCE SPACING
.ss 12 0
This is line 1. This is line 2.
This is line 1. This is line 2.
This is line 1.
This is line 2.
Results:
$ groff -T ascii spaces.tr |sed -n -e/./p
This is line 1. This is line 2.
This is line 1. This is line 2.
This is line 1. This is line 2.
SET SENTENCE SPACING
This is line 1. This is line 2.
This is line 1. This is line 2.
This is line 1. This is line 2.
So the following will work, but I hope there is a better option.
This is line 1. \
This is line 2.
renders as
This is line 1. This is line 2.

Resources