How to extend a sprache parser to cope with leading and trailing free text - sprache

I have a sprache parser that successfully recognizes a variety of complex strings.
I now have to find these strings if they are embedded in free text. Is this possible?
For example, "FJ21 [7-20]" and "7.2x1.2 FULL" are examples of strings that my parser can match.
I need to be able to find them within text such as:
"The quick brown FJ21 [7-20] jumps over the lazy 7.2x1.2 FULL"

Related

Are there 2 types of double-quotes in R? My double quotes look slanted and are giving an error message

I am using the statement in R:
setwd("C:\\Users\\carl\\Documents\\research")
to set the working directory. It worked fine when I pasted the statement from someone else's R script but I received an error message:
Error: unexpected input in "setwd("".
when I entered the command directly or when I copied it from my script in a Word file.
It seems to be related to the fact that the double-quotes that I typed (that don't work) look a little slanted while the double-quotes in the pasted text (that work fine) look like they're straight up and down. Is there something I can do to type plain looking double-quotes instead of slanted double-quotes?
Word automatically replaces your double quotes with so-called smart quotes or curly quotes.
You need to use the regular/straight double quotes (") in r.
This support article explains how you can disable the automatic smart quote replacement in Word. In fairness though, Word is probably not the... um... ideal code editor.

Search up a line in a paragraph

I'd like to extract segments of line from a text.
For example:
txt<-"This is some cool text that involves this type of text and not that kind."
extract.context(txt,start="of text",end="that")
"of text and not that"
It kind of depends on what exactly what you will be looking for. If you will be just searching for characters (no punctuation), then this will work nicely.
extract.context<-function(txt, start, end) {
sapply(regmatches(txt, gregexpr(paste0(start,".*",end),txt)), "[", 1)
}
txt<-"This is some cool text that involves this type of text and not that kind."
extract.context(txt,start="of text",end="that")
# [1] "of text and not that"
This method uses a basic regular expression so if you search for character that may be matched by regular expression syntax, it could get confused. Also it's unclear what you want to do should multiple matches occur. Right now i just return the first. But since you didn't provide a lot of context, i'm going to assume that's OK.

Which function should I use to read unstructured text file into R? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
This is my first ever question here and I'm new to R, trying to figure out my first step in how to do data processing, please keep it easy : )
I'm wondering what would be the best function and a useful data structure in R to load unstructured text data for further processing. For example, let's say I have a book stored as a text file, with no new line characters in it.
Is it a good idea to use read.delim() and store the data in a list? Or is a character vector better, and how would I define it?
Thank you in advance.
PN
P.S. If I use "." as my delimeter, it would treat things like "Mr." as a separate sentence. While this is just an example and I'm not concerned about this flaw, just for educational purposes, I'd still be curious how you'd go around this problem.
read.delim reads in data in table format (with rows and columns, as in Excel). It is not very useful for reading a string of text.
To read text from a text file into R you can use readLines(). readLines() creates a character vector with as many elements as lines of text. A line, for this kind of software, is any string of text that ends with a newline. (Read about newline on Wikipedia.) When you write text, you enter your system specific newline character(s) by pressing Return. In effect, a line of text is not defined by the width of your software window, but can run over many visual rows. In effect, a line of text is what in a book would be a a paragraph. So readLines() splits your text at the paragraphs:
> readLines("/path/to/tom_sawyer.txt")
[1] "\"TOM!\""
[2] "No answer."
[3] "\"TOM!\""
[4] "No answer."
[5] "\"What's gone with that boy, I wonder? You TOM!\""
[6] "No answer."
[7] "The old lady pulled her spectacles down and looked over them about the room; then she put them up and looked out under them. She seldom or never looked through them for so small a thing as a boy; they were her state pair, the pride of her heart, and were built for \"style,\" not service—she could have seen through a pair of stove-lids just as well. She looked perplexed for a moment, and then said, not fiercely, but still loud enough for the furniture to hear:"
[8] "\"Well, I lay if I get hold of you I'll—\"
Note that you can scroll long text to the left here in Stackoverflow. That seventh line is longer than this column is wide.
As you can see, readLines() read that long seventh paragraph as one line. And, as you can also see, readLines() added a backslash in front of each quotation mark. Since R holds the individual lines in quotation marks, it needs to distinguish these from those that are part of the original text. Therefore, it "escapes" the original quotation marks. Read about escaping on Wikipedia.
readLines() may output a warning that an "incomplete final line" was found in your file. This only means that there was no newline after the last line. You can suppress this warning with readLines(..., warn = FALSE), but you don't have to, it is not an error, and supressing the warning will do nothing but supress the warning message.
If you don't want to just output your text to the R console but process it further, create an object that holds the output of readLines():
mytext <- readLines("textfile.txt")
Besides readLines(), you can also use scan(), readBin() and other functions to read text from files. Look at the manual by entering ?scan etc. Look at ?connections to learn about many different methods to read files into R.
I would strongly advise you to write your text in a .txt-file in a text editor like Vim, Notepad, TextWrangler etc., and not compose it in a word processor like MS Word. Word files contain more than the text you see on screen or printed, and those will be read by R. You can try and see what you get, but for good results you should either save your file as a .txt-file from Word or compose it in a text editor.
You can also copy-paste your text from a text file open in any other software to R or compose your text in the R console:
myothertext <- c("What did you do?
+ I wrote some text.
+ Ah, interesting.")
> myothertext
[1] "What did you do?\nI wrote some text.\nAh, interesting."
Note how entering Return does not cause R to execute the command before I closed the string with "). R just replies with +, telling me that I can continue to edit. I did not type in those plusses. Try it. Note also that now the newlines are part of your string of text. (I'm on a Mac, so my newline is \n.)
If you input your text manually, I would load the whole text as one string into a vector:
x <- c("The text of your book.")
You could load different chapters into different elements of this vector:
y <- c("Chapter 1", "Chapter 2")
For better reference, you can name the elements:
z <- c(ch1 = "This is the text of the first chapter. It is not long! Why was the author so lazy?", ch2 = "This is the text of the second chapter. It is even shorter.")
Now you can split the elements of any of these vectors:
sentences <- strsplit(z, "[.!?] *")
Enter ?strsplit to read the manual for this function and learn about the attributes it takes. The second attribute takes a regular expression. In this case I told strsplit to split the elements of the vector at any of the three punctuation marks followed by an optional space (if you don't define a space here, the resulting "sentences" will be preceded by a space).
sentences now contains:
> sentences
$ch1
[1] "This is the text of the first chapter" "It is not long"
[3] "Why was the author so lazy"
$ch2
[1] "This is the text of the second chapter" "It is even shorter"
You can access the individual sentences by indexing:
> sentences$ch1[2]
[3] "It is not long"
R will be unable to know that it should not split after "Mr.". You must define exceptions in your regular expression. Explaining this is beyond the scope of this question.
How you would tell R how to recognize subjects or objects, I have no idea.

HTML and XML Parsing in Fortran

I am studying mathematical computation and I am completely stuck on this task! I don't even know how to go about starting it!
**Write a program in Fortran that can parse a single line of well-formed HTML or XML markup so that it takes input on a single line (guaranteed to not exceed 80 characters in total) like
-lots of lovely text
where
tag might be anything from 1 to 37 ASCII characters and will not contain spaces
text could contain spaces and be anything from 1 to 73 characters in length
so that the program outputs one of two lines:
tag : text if the two occurrences of tag match inside <...> and
syntax error if anything else is input.
Any help is hugely appreciated !**
There are a number of intrinsic functions for working with strings that may be helpful.
result = index(string, substring) - returns the position of the start of the first occurrence of string substring as a substring in string, counting from one. (Fortran 77)
result = scan(string, set) - scans a string for any of the characters in a set of characters. (Fortran 95)
result = verify(string, set) - verifies that all the characters in a string are present in a set. (Fortran 95)
There are a few user-contributed string tokenization functions on the Fortran Wiki that might be helpful:
delim, strtok, and find_field. Also, FLIBS includes some string manipulation and tokenization routines that might be useful as examples.
Finally, there are a number of existing open-source XML parsers written in Fortran: xmlf90 and xml-fortran. Looking at the source code for these libraries should be helpful.

IDML : What are Kinsoku/Mojikumi tables?

I am new to the world of Adobe InDesign and IDML file format. I am trying to understand the IDML file format so that I can create IDML files dynamically through code!
I am going through the IDML File format specification and have found references to "Mojikumi Tables" and "Kinsoku Tables" and "Aki". Though the documentation defines various attributes for these elements, there's no clear explanation what these elements actually are.
Any pointers or links to relevant articles would be really helpful.
Thanks.
These are all additional typography settings used in laying out Japanese text.
Kinsoku: A rule set in the Japanese language that is used to determine characters that are not permitted at the beginning or end of a line. Reference.
Mojikumi: Determines spacing between punctuation, symbols, numbers, and other character classes in Japanese type. Reference.
Aki: Means space in Japanese:
"When the glyphs that correspond to characters of different character
classes come together in a run of text, there is spacing behaviour. In
other words, extra space, measured using a fraction of an em, is
introduced depending on which two character classes are in proximity*.
Typical values are one-fourth and one-half of an em"
(Footnote: * 'In Japanese this space is referred to as aki, which simply means
"space"')
Reference and source for this quote.
Here's a link to a book that should provide more information: CJKV Information Processing, 2nd Edition

Resources