DATA
mystring1 <- "Other work has shown that, in addition to language-general features such as a decreased speaking rate and an expanded pitch range, clear speech production involves the enhancement of the acoustic-phonetic distance between phonologically contrastive categories e.g., Ferguson and Kewley-Port, 2002; Krause and Braida, 2004, Picheny et al, 1986; Smiljanic and Bradlow, 2005, 2007."
mystring2 <- "Other work has shown that, in addition to language-general features such as a decreased speaking rate and an expanded pitch range, clear speech production involves the enhancement of the acoustic-phonetic distance between phonologically contrastive categories e.g., Ferguson and Kewley-Port, 2002; Krause and Braida, 2004, Picheny et al, 1986; Smiljanic and Bradlow, 2005, 2007. Therefore, reduced sensitivity to any or all of the language-specific acoustic-phonetic dimensions of contrast and clear speech enhancement would yield a diminished clear speech benefit for non-native listeners. This may appear somewhat surprising given that clear speech production was elicited in our studies by instructing the talkers to speak clearly for the sake of listeners with either a hearing impairment or from a different native language background. However, as discussed further in Bradlow and Bent 2002, the limits of clear speech as a means of enhancing non-native speech perception likely reflect the “mistuning” that characterizes spoken language communication between native and non-native speakers."
I'd like to receive some help for regular expression. I got some text data. Basically I want to remove citation parts that appear between last word in a sentence and a period. However, parentheses are somehow missing. mystring1 is an example for that. In this example, I want to remove e.g., Ferguson and Kewley-Port, 2002; Krause and Braida, 2004, Picheny et al, 1986; Smiljanic and Bradlow, 2005, 2007. But this sentence is just one of the sentences in a paragraph. mystring2 contains three more sentences following mystring1. My goal is to remove the citation part from mystring2. But I have not been successful; the pattern is removing more texts than I want. How can I revise regex pattern? Thank you for your help in advance.
# This works for mystring1.
gsub(x = mystring1, pattern = "e\\.g\\.,.*[0-9]{4}(?=.)", replacement = "", perl = T)
[1] "Other work has shown that, in addition to language-general features such as a
decreased speaking rate and an expanded pitch range, clear speech production involves
the enhancement of the acoustic-phonetic distance between phonologically contrastive
categories ."
# But this pattern does not work for mystring2; gsub() removes texts more than I want.
gsub(x = mystring2, pattern = "e\\.g\\.,.*[0-9]{4}(?=.)", replacement = "", perl = T)
[1] "Other work has shown that, in addition to language-general features such as a decreased
speaking rate and an expanded pitch range, clear speech production involves the
enhancement of the acoustic-phonetic distance between phonologically contrastive
categories , the limits of clear speech ... (I trimmed texts here) speakers."
I suggest using
\be\.g\.,.*?[0-9]{4}[^\w.]*(?=\.)
See the regex demo.
Details
\be\.g\. - a whole word e.g. (\b is a word boundary)
, - a comma
.*? - any 0+ chars other than line break chars (add (?s) at the pattern start to make it match line breaks, too)
[0-9]{4} - four digits
[^\w.]* - 0+ chars other than word chars and dot
(?=\.) - (a positive lookahead matching a location where) a . must be immediately to the right of the current location.
R demo:
rx <- "\\be\\.g\\.,.*?[0-9]{4}[^\\w.]*(?=\\.)"
gsub(x = mystring1, pattern = rx, replacement = "", perl = TRUE)
## => [1] "Other work has shown that, in addition to language-general features such as a decreased speaking rate and an expanded pitch range, clear speech production involves the enhancement of the acoustic-phonetic distance between phonologically contrastive categories ."
gsub(x = mystring2, pattern = rx, replacement = "", perl = TRUE)
## => [1] "Other work has shown that, in addition to language-general features such as a decreased speaking rate and an expanded pitch range, clear speech production involves the enhancement of the acoustic-phonetic distance between phonologically contrastive categories . Therefore, reduced sensitivity to any or all of the language-specific acoustic-phonetic dimensions of contrast and clear speech enhancement would yield a diminished clear speech benefit for non-native listeners. This may appear somewhat surprising given that clear speech production was elicited in our studies by instructing the talkers to speak clearly for the sake of listeners with either a hearing impairment or from a different native language background. However, as discussed further in Bradlow and Bent 2002, the limits of clear speech as a means of enhancing non-native speech perception likely reflect the “mistuning” that characterizes spoken language communication between native and non-native speakers."
Related
I Have:
Stringa=" This is different from primary data created specifically by researchers to reflect concepts that are higher-order and more abstract(Lee,1991;Walsham,1995).Given the major differences between big data and research-collected data, it is surprising how little discussion has arisen about how using big data should change the practice of theory-informed IS research. Some scholars have noted that the very nature of inquiry is likely to change, given that large data sets, advanced algorithms, and powerful computing capabilities can initiate and refine questions without human intervention (Agarwal & Dhar, 2014). Other commentators argue that the scientific method is likely to become obsolete, as with the “availability of huge amounts of data, along with the statistical tools to crunch these numbers … science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson, 2008). Perhaps “scientists no longer have to make educated guesses, construct hypotheses and models, test them in data-based experiments andexamples. Instead, they canmine thecomplete setof data forpatterns that reveal effects, producing scientificconclusions without further experimentation” (Prensky, 2009). "
Desidered Output:
[1]This is different from primary data created specifically by researchers to reflect concepts that are higher-order and more abstract(Lee,1991;Walsham,1995).
[2]Some scholars have noted that the very nature of inquiry is likely to change, given that large data sets, advanced algorithms, and powerful computing capabilities can initiate and refine questions without human intervention (Agarwal & Dhar, 2014)
[3] Other commentators argue that the scientific method is likely to become obsolete, as with the “availability of huge amounts of data, along with the statistical tools to crunch these numbers … science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson, 2008)
[4]Instead, they canmine thecomplete setof data forpatterns that reveal effects, producing scientific conclusions without further experimentation” (Prensky, 2009)
I use:unlist(str_extract_all(string =Stringa, pattern = "\\. [A-Za-z][^()]+ \\("))
But it doesn't work
I don’t want extract ‘Given the major differences between big data and research-collected data, it is surprising how little discussion has arisen about how using big data should change the practice of theory-informed IS research. ‘ and ‘Perhaps “scientists no longer have to make educated guesses, construct hypotheses and models, test them in data-based experiments andexamples. ‘
If there are no abbreviations in the text, you may use
regmatches(Stringa, gregexpr("[^.?!\\s][^.!?]*?\\([^()]*\\)", Stringa, perl=TRUE))
[[1]]
[1] "This is different from primary data created specifically by researchers to reflect concepts that are higher-order and more abstract(Lee,1991;Walsham,1995)"
[2] "Some scholars have noted that the very nature of inquiry is likely to change, given that large data sets, advanced algorithms, and powerful computing capabilities can initiate and refine questions without human intervention (Agarwal & Dhar, 2014)"
[3] "Other commentators argue that the scientific method is likely to become obsolete, as with the “availability of huge amounts of data, along with the statistical tools to crunch these numbers … science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson, 2008)"
[4] "Instead, they canmine thecomplete setof data forpatterns that reveal effects, producing scientificconclusions without further experimentation” (Prensky, 2009)"
See the regex demo and the R demo.
Details
[^.?!\\s] - any char but ., ?, ! and whitespace
[^.!?]*? - any 0+ chars other than ., ?, ! as few as possible
\([^()]*\) - a (, 0+ chars other than ( and ) and then a ).
We can handle this using grepexpr and regmatches, using the following regex pattern:
.*?\([^)]+\).*?(?=\w|$)
This will capture any content up to the first parenthesis, followed by a (...) term. The script below will capture all such matches in the source text.
m <- gregexpr(".*?\\([^)]+\\).*?(?=\\w|$)", x, perl=TRUE)
regmatches(x, m)
[[1]]
[1] "This is different from primary data created specifically by researchers to reflect concepts that are higher-order and more abstract(Lee,1991;Walsham,1995)."
[2] "Given the major differences between big data and research-collected data, it is surprising how little discussion has arisen about how using big data should change the practice of theory-informed IS research. Some scholars have noted that the very nature of inquiry is likely to change, given that large data sets, advanced algorithms, and powerful computing capabilities can initiate and refine questions without human intervention (Agarwal & Dhar, 2014). "
[3] "Other commentators argue that the scientific method is likely to become obsolete, as with the “availability of huge amounts of data, along with the statistical tools to crunch these numbers … science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson, 2008). "
[4] "Perhaps “scientists no longer have to make educated guesses, construct hypotheses and models, test them in data-based experiments andexamples. Instead, they canmine thecomplete setof data forpatterns that reveal effects, producing scientificconclusions without further experimentation”(Prensky, 2009). "
I want to extract lines that starts with an "i" in a text file
I tried this
i <- grep("^i.", text,value = TRUE)
all the lines including iii and ii are extracted. How to solve this problem?
data
"i. provides substantial identification and comment upon significant aspects of texts \\"
"ii. provides substantial identification and comment upon the creator\\'92s choices \\"
"iii. sufficiently justifies opinions and ideas with examples and explanations; uses accurate terminology
We need to escape the . to read it as character .. Otherwise, it means any character.
grep('^i\\.', text, value=TRUE)
#[1] "i. provides substantial identification and comment upon significant aspects of texts \\"
data
text <- c("i. provides substantial identification and comment upon significant aspects of texts \\",
"ii. provides substantial identification and comment upon the creator\\'92s choices \\",
"iii. sufficiently justifies opinions and ideas with examples and explanations; uses accurate terminology ")
I need some help with the specific arguments of the agrep package in R.
In terms of cost, all, insertions, deletions and substitutions each have a "maximum number/fraction of substitutions" integer or fraction input parameter.
Ive read the documentation on it, but I still cannot figure out some specifics:
What is the difference of a "cost=1" and "all=1"?
How is a decimal interpreted, such as "cost=0.1", "inserts=0.9", "all=0.25", etc.?
I understand the basics of the Levenshtein Distance, but how is it applied in terms of the cost or all arguments?
Sorry if this is fairly basic, but like I said, the documentation I have read on it is slightly confusing.
Thanks in advance
Not 100% certain, but here is my understanding:
in max.distance, cost and all are interchangeable if you don't specify a costs argument (this is the next argument); if you do, then cost will limit based on the weighted (as per costs) costs of insertion/deletion/substitutions you specified, whereas all will limit on the raw count of those operations
The fraction represents what fraction of the number of characters in your pattern argument you want to allow as insertion/deletions/substitutions (i.e. 0.1 on a 10 character pattern would allow 1 change). If you specify costs, then it is the fraction of # of characters in pattern * max(costs), though presumably fractions in max.distance{insertions/deletions/substitutions} will be # of characters * corresponding costs value.
I agree that the documentation is not as complete as it could be. I discovered the above by building simple test examples and messing around with them. You should be able to do the same an confirm for yourself, particularly the last part (i.e. whether costs affects the fraction measure of max.distance{insertions/deletions/substitutions}), which I haven't tested.
How does the 68000 internally represent instructions.
I've read that there are different types of instructions: single effective operation word format instructions, brief and full extension word format instructions. The single effective operation word instruction seems to represent the instruction and the lower 6 bits of this instruction the addressing mode and register. Does this addressing mode and register tell you if there follows a brief or full extension word format instruction, which on his turn represents the operands for the instruction. Do you know a better manual than the 68000 programming reference manual.
Thanks in advance
The actual internal representation is a combination of "microcode" and "nanocode". The 68000 has 544 17-bit microcode words which dispaches to 366 68-bit nanocode words.
While this may not be what you wanted to know, this link may provide some insights:
http://www.easy68k.com/paulrsm/doc/dpbm68k1.htm
right, on m68000 indexed modes uses the brief extension. In "Address Register Indirect with Index (8-Bit Displacement) Mode" (d8, An, Xn), the BEW is filled with D/A (if Xn is a data or address register), Xn (the register number), W/L (to threat Xn contents as 16 or 32bits), scale to 0 (see note), and the 8-bit displacement.
on other hand, other modes, like the 16bit displacement, "Address with displacement" (d16,An) , the extension is only a word with the displacement.
the note is: brief extension word - m68k doesn't support the 2bits for scale so is set to 0; scale on BEW using the scale bits, and full extensions are only suported m68020,40,-> cpus. http://etd.dtu.dk/thesis/264182/bac10_19.pdf
There's a problem I've encountered a lot (in the broad fields of data analyis or AI). However I can't name it, probably because I don't have a formal CS background. Please bear with me, I'll give two examples:
Imagine natural language parsing:
The flower eats the cow.
You have a program that takes each word, and determines its type and the relations between them. There are two ways to interpret this sentence:
1) flower (substantive) -- eats (verb) --> cow (object)
using the usual SVO word order, or
2) cow (substantive) -- eats (verb) --> flower (object)
using a more poetic world order. The program would rule out other possibilities, e.g. "flower" as a verb, since it follows "the". It would then rank the remaining possibilites: 1) has a more natural word order than 2), so it gets more points. But including the world knowledge that flowers can't eat cows, 2) still wins. So it might return both hypotheses, and give 1) a score of 30, and 2) a score of 70.
Then, it remembers both hypotheses and continues parsing the text, branching off. One branch assumes 1), one 2). If a branch reaches a contradiction, or a ranking of ~0, it is discarded. In the end it presents ranked hypotheses again, but for the whole text.
For a different example, imagine optical character recognition:
** **
** ** *****
** *******
******* **
* ** **
** **
I could look at the strokes and say, sure this is an "H". After identifying the H, I notice there are smudges around it, and give it a slightly poorer score.
Alternatively, I could run my smudge recognition first, and notice that the horizontal line looks like an artifact. After removal, I recognize that this is ll or Il, and give it some ranking.
After processing the whole image, it can be Hlumination, lllumination or Illumination. Using a dictionary and the total ranking, I decide that it's the last one.
The general problem is always some kind of parsing / understanding. Examples:
Natural languages or ambiguous languages
OCR
Path finding
Dealing with ambiguous or incomplete user imput - which interpretations make sense, which is the most plausible?
I'ts recursive.
It can bail out early (when a branch / interpretation doesn't make sense, or will certainly end up with a score of 0). So it's probably some kind of backtracking.
It keeps all options in mind in light of ambiguities.
It's based off simple rules at the bottom can_eat(cow, flower) = true.
It keeps a plausibility ranking of interpretations.
It's recursive on a meta level: It can fork / branch off into different 'worlds' where it assumes different hypotheses when dealing with the next part of data.
It'll forward the individual rankings, probably using bayesian probability, to dependent hypotheses.
In practice, there will be methods to train this thing, determine ranking coefficients, and there will be cutoffs if the tree becomes too big.
I have no clue what this is called. One might guess 'decision tree' or 'recursive descent', but I know those terms mean different things.
I know Prolog can solve simple cases of this, like genealogies and finding out who is whom's uncle. But you have to give all the data in code, and it doesn't seem convienent or powerful enough to do this for my real life cases.
I'd like to know, what is this problem called, are there common strategies for dealing with this? Is there good literature on the topic? Are there libraries for ideally C(++), Python, were you can just define a bunch of rules, and it works out all the rankings and hypotheses?
I don't think there is one answer that fits all the bullet points you have. But I hope my links will lead you closer to an answer or might give you a different question.
I think the closest answer is Bayesian network since you have probabilities affecting each other as I understand it, it is also related to Conditional probability and Fuzzy Logic
You also describe a bit of genetic programming as well as Artificial Neural Networks
I can name drop some more topics which might be related:
http://en.wikipedia.org/wiki/Rule-based_programming
http://en.wikipedia.org/wiki/Expert_system
http://en.wikipedia.org/wiki/Knowledge_engineering
http://en.wikipedia.org/wiki/Fuzzy_system
http://en.wikipedia.org/wiki/Bayesian_inference