I Have:
Stringa=" This is different from primary data created specifically by researchers to reflect concepts that are higher-order and more abstract(Lee,1991;Walsham,1995).Given the major differences between big data and research-collected data, it is surprising how little discussion has arisen about how using big data should change the practice of theory-informed IS research. Some scholars have noted that the very nature of inquiry is likely to change, given that large data sets, advanced algorithms, and powerful computing capabilities can initiate and refine questions without human intervention (Agarwal & Dhar, 2014). Other commentators argue that the scientific method is likely to become obsolete, as with the “availability of huge amounts of data, along with the statistical tools to crunch these numbers … science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson, 2008). Perhaps “scientists no longer have to make educated guesses, construct hypotheses and models, test them in data-based experiments andexamples. Instead, they canmine thecomplete setof data forpatterns that reveal effects, producing scientificconclusions without further experimentation” (Prensky, 2009). "
Desidered Output:
[1]This is different from primary data created specifically by researchers to reflect concepts that are higher-order and more abstract(Lee,1991;Walsham,1995).
[2]Some scholars have noted that the very nature of inquiry is likely to change, given that large data sets, advanced algorithms, and powerful computing capabilities can initiate and refine questions without human intervention (Agarwal & Dhar, 2014)
[3] Other commentators argue that the scientific method is likely to become obsolete, as with the “availability of huge amounts of data, along with the statistical tools to crunch these numbers … science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson, 2008)
[4]Instead, they canmine thecomplete setof data forpatterns that reveal effects, producing scientific conclusions without further experimentation” (Prensky, 2009)
I use:unlist(str_extract_all(string =Stringa, pattern = "\\. [A-Za-z][^()]+ \\("))
But it doesn't work
I don’t want extract ‘Given the major differences between big data and research-collected data, it is surprising how little discussion has arisen about how using big data should change the practice of theory-informed IS research. ‘ and ‘Perhaps “scientists no longer have to make educated guesses, construct hypotheses and models, test them in data-based experiments andexamples. ‘
If there are no abbreviations in the text, you may use
regmatches(Stringa, gregexpr("[^.?!\\s][^.!?]*?\\([^()]*\\)", Stringa, perl=TRUE))
[[1]]
[1] "This is different from primary data created specifically by researchers to reflect concepts that are higher-order and more abstract(Lee,1991;Walsham,1995)"
[2] "Some scholars have noted that the very nature of inquiry is likely to change, given that large data sets, advanced algorithms, and powerful computing capabilities can initiate and refine questions without human intervention (Agarwal & Dhar, 2014)"
[3] "Other commentators argue that the scientific method is likely to become obsolete, as with the “availability of huge amounts of data, along with the statistical tools to crunch these numbers … science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson, 2008)"
[4] "Instead, they canmine thecomplete setof data forpatterns that reveal effects, producing scientificconclusions without further experimentation” (Prensky, 2009)"
See the regex demo and the R demo.
Details
[^.?!\\s] - any char but ., ?, ! and whitespace
[^.!?]*? - any 0+ chars other than ., ?, ! as few as possible
\([^()]*\) - a (, 0+ chars other than ( and ) and then a ).
We can handle this using grepexpr and regmatches, using the following regex pattern:
.*?\([^)]+\).*?(?=\w|$)
This will capture any content up to the first parenthesis, followed by a (...) term. The script below will capture all such matches in the source text.
m <- gregexpr(".*?\\([^)]+\\).*?(?=\\w|$)", x, perl=TRUE)
regmatches(x, m)
[[1]]
[1] "This is different from primary data created specifically by researchers to reflect concepts that are higher-order and more abstract(Lee,1991;Walsham,1995)."
[2] "Given the major differences between big data and research-collected data, it is surprising how little discussion has arisen about how using big data should change the practice of theory-informed IS research. Some scholars have noted that the very nature of inquiry is likely to change, given that large data sets, advanced algorithms, and powerful computing capabilities can initiate and refine questions without human intervention (Agarwal & Dhar, 2014). "
[3] "Other commentators argue that the scientific method is likely to become obsolete, as with the “availability of huge amounts of data, along with the statistical tools to crunch these numbers … science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson, 2008). "
[4] "Perhaps “scientists no longer have to make educated guesses, construct hypotheses and models, test them in data-based experiments andexamples. Instead, they canmine thecomplete setof data forpatterns that reveal effects, producing scientificconclusions without further experimentation”(Prensky, 2009). "
Related
I want to extract lines that starts with an "i" in a text file
I tried this
i <- grep("^i.", text,value = TRUE)
all the lines including iii and ii are extracted. How to solve this problem?
data
"i. provides substantial identification and comment upon significant aspects of texts \\"
"ii. provides substantial identification and comment upon the creator\\'92s choices \\"
"iii. sufficiently justifies opinions and ideas with examples and explanations; uses accurate terminology
We need to escape the . to read it as character .. Otherwise, it means any character.
grep('^i\\.', text, value=TRUE)
#[1] "i. provides substantial identification and comment upon significant aspects of texts \\"
data
text <- c("i. provides substantial identification and comment upon significant aspects of texts \\",
"ii. provides substantial identification and comment upon the creator\\'92s choices \\",
"iii. sufficiently justifies opinions and ideas with examples and explanations; uses accurate terminology ")
I have this problem of matching two strings for 'more general', 'less general', 'same meaning', 'opposite meaning' etc.
The strings can be from any domain. Assume that the strings can be from people's emails.
To give an example,
String 1 = "movies"
String 2 = "Inception"
Here I should know that Inception is less general than movies (sort of is-a relationship)
String 1 = "Inception"
String 2 = "Christopher Nolan"
Here I should know that Inception is less general than Christopher Nolan
String 1 = "service tax"
String 2 = "service tax 2015"
At a glance it appears to me that S-match will do the job. But I am not sure if S-match can be made to work on knowledge bases other than WordNet or GeoWordNet (as mentioned in their page).
If I use word2vec or dl4j, I guess it can give me the similarity scores. But does it also support telling a string is more general or less general than the other?
But I do see word2vec can be based on a training set or large corpus like wikipedia etc.
Can some one throw light on the way to go forward?
The current usage of machine learning methods such as word2vec and dl4j for modelling words are based on distributional hypothesis. They train models of words and phrases based on their context. There is no ontological aspects in these word models. At its best trained case a model based on these tools can say if two words can appear in similar contexts. That is how their similarity measure works.
The Mikolov papers (a, b and c) which suggests that these models can learn "Linguistic Regularity" doesn't have any ontological test analysis, it only suggests that these models are capable of predicting "similarity between members of the word pairs". This kind of prediction doesn't help your task. These models are even incapable of recognising similarity in contrast with relatedness (e.g. read this page SimLex test set).
I would say that you need an ontological database to solve your problem. More specifically about your examples, it seems for String 1 and String 2 in your examples:
String 1 = "a"
String 2 = "b"
You are trying to check entailment relations in sentences:
(1) "c is b"
(2) "c is a"
(3) "c is related to a".
Where:
(1) entails (2)
or
(1) entails (3)
In your two first examples, you can probably use semantic knowledge bases to solve the problem. But your third example will probably need a syntactical parsing before understanding the difference between two phrases. For example, these phrases:
"men"
"all men"
"tall men"
"men in black"
"men in general"
It needs a logical understanding to solve your problem. However, you can analyse that based on economy of language, adding more words to a phrase usually makes it less general. Longer phrases are less general comparing to shorter phrases. It doesn't give you a precise tool to solve the problem, but it can help to analyse some phrases without special words such as all, general or every.
There's a problem I've encountered a lot (in the broad fields of data analyis or AI). However I can't name it, probably because I don't have a formal CS background. Please bear with me, I'll give two examples:
Imagine natural language parsing:
The flower eats the cow.
You have a program that takes each word, and determines its type and the relations between them. There are two ways to interpret this sentence:
1) flower (substantive) -- eats (verb) --> cow (object)
using the usual SVO word order, or
2) cow (substantive) -- eats (verb) --> flower (object)
using a more poetic world order. The program would rule out other possibilities, e.g. "flower" as a verb, since it follows "the". It would then rank the remaining possibilites: 1) has a more natural word order than 2), so it gets more points. But including the world knowledge that flowers can't eat cows, 2) still wins. So it might return both hypotheses, and give 1) a score of 30, and 2) a score of 70.
Then, it remembers both hypotheses and continues parsing the text, branching off. One branch assumes 1), one 2). If a branch reaches a contradiction, or a ranking of ~0, it is discarded. In the end it presents ranked hypotheses again, but for the whole text.
For a different example, imagine optical character recognition:
** **
** ** *****
** *******
******* **
* ** **
** **
I could look at the strokes and say, sure this is an "H". After identifying the H, I notice there are smudges around it, and give it a slightly poorer score.
Alternatively, I could run my smudge recognition first, and notice that the horizontal line looks like an artifact. After removal, I recognize that this is ll or Il, and give it some ranking.
After processing the whole image, it can be Hlumination, lllumination or Illumination. Using a dictionary and the total ranking, I decide that it's the last one.
The general problem is always some kind of parsing / understanding. Examples:
Natural languages or ambiguous languages
OCR
Path finding
Dealing with ambiguous or incomplete user imput - which interpretations make sense, which is the most plausible?
I'ts recursive.
It can bail out early (when a branch / interpretation doesn't make sense, or will certainly end up with a score of 0). So it's probably some kind of backtracking.
It keeps all options in mind in light of ambiguities.
It's based off simple rules at the bottom can_eat(cow, flower) = true.
It keeps a plausibility ranking of interpretations.
It's recursive on a meta level: It can fork / branch off into different 'worlds' where it assumes different hypotheses when dealing with the next part of data.
It'll forward the individual rankings, probably using bayesian probability, to dependent hypotheses.
In practice, there will be methods to train this thing, determine ranking coefficients, and there will be cutoffs if the tree becomes too big.
I have no clue what this is called. One might guess 'decision tree' or 'recursive descent', but I know those terms mean different things.
I know Prolog can solve simple cases of this, like genealogies and finding out who is whom's uncle. But you have to give all the data in code, and it doesn't seem convienent or powerful enough to do this for my real life cases.
I'd like to know, what is this problem called, are there common strategies for dealing with this? Is there good literature on the topic? Are there libraries for ideally C(++), Python, were you can just define a bunch of rules, and it works out all the rankings and hypotheses?
I don't think there is one answer that fits all the bullet points you have. But I hope my links will lead you closer to an answer or might give you a different question.
I think the closest answer is Bayesian network since you have probabilities affecting each other as I understand it, it is also related to Conditional probability and Fuzzy Logic
You also describe a bit of genetic programming as well as Artificial Neural Networks
I can name drop some more topics which might be related:
http://en.wikipedia.org/wiki/Rule-based_programming
http://en.wikipedia.org/wiki/Expert_system
http://en.wikipedia.org/wiki/Knowledge_engineering
http://en.wikipedia.org/wiki/Fuzzy_system
http://en.wikipedia.org/wiki/Bayesian_inference
While I was reading about lambda calculus, came across the word Lambda definability. Can someone please explain what that is as I couldn't find any good resources on that.
Thanks
More generally, there is a line of research seeking to characterize "lambda definability" over a broad class of languages. "lambda definability" itself is typically relative to a semantics of a language given in terms of sets. For a type T in our language, write |T| for its interpretation as a set. Now, take an element of |T| -- call it e. We want to know if there is a term in our language -- call it x : T (x of type T), such that |x| is e. If there is such a term, then we say that t is lambda-definable.
Now, in our perfect world, when we interpret a language into sets, we would like to say that the sets associated with each type are precisely those that contain the lambda-definable elements of that type and only the lambda-definable elements (completeness). It would also be nice, perhaps to say that we can provide an algorithm to determine if a claimed element of a set has an associated lambda term (decidability).
Now, often we don't just model into sets, but into other funny mathematical constructions. And we don't model just from the lambda calculus, but from other related systems such as Plotkin's PCF or the like. But the property under study is typically still called "lambda-definability".
After decades of research there are still many open problems and questions in this regard -- while certain lower-order terms have been shown to have decidable lambda-definability (the classic results involve terms up to second-order), many terms do not yield so easily. This paper ("The Undecidability of lambda-Definability" by Ralph Loader) gives an important such undecidability result and characterizes some consequences: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.36.6860
See the Church-Turing thesis, where lambda-definable functions (from Church) are those that give us "effectively computable" functions. Turing showed that programs implementable on a Turing machine are equivalent to lambda-definable functions.
I'm maintaining code for a mathematical algorithm that came from a book, with references in the comments. Is it better to have variable names that are descriptive of what the variables represent, or should the variables match what is in the book?
For a simple example, I may see this code, which reflects the variable in the book.
A_c = v*v/r
I could rewrite it as
centripetal_acceleration = velocity*velocity/radius
The advantage of the latter is that anyone looking at the code could understand it. However, the advantage of the former is that it is easier to compare the code with what is in the book. I may do this in order to double check the implementation of the algorithms, or I may want to add additional calculations.
Perhaps I am over-thinking this, and should simply use comments to describe what the variables are. I tend to favor self-documenting code however (use descriptive variable names instead of adding comments to describe what they are), but maybe this is a case where comments would be very helpful.
I know this question can be subjective, but I wondered if anyone had any guiding principles in order to make a decision, or had links to guidelines for coding math algorithms.
I would prefer to use the more descriptive variable names. You can't guarantee everyone that is going to look at the code has access to "the book". You may leave and take your copy, it may go out of print, etc. In my opinion it's better to be descriptive.
We use a lot of mathematical reference books in our work, and we reference them in comments, but we rarely use the same mathematically abbreviated variable names.
A common practise is to summarise all your variables, indexes and descriptions in a comment header before starting the code proper. eg.
// A_c = Centripetal Acceleration
// v = Velocity
// r = Radius
A_c = (v^2)/r
I write a lot of mathematical software. IF I can insert in the comments a very specific reference to a book or a paper or (best) web site that explains the algorithm and defines the variable names, then I will use the SHORT names like a = v * v / r because it makes the formulas easier to read and write and verify visually.
IF not, then I will write very verbose code with lots of comments and long descriptive variable names. Essentially, my code becomes a paper that describes the algorithm (anyone remember Knuth's "Literate Programming" efforts, years ago? Though the technology for it never took off, I emulate the spirit of that effort). I use a LOT of ascii art in my comments, with box-and-arrow diagrams and other descriptive graphics. I use Jave.de -- the Java Ascii Vmumble Editor.
I will sometimes write my math with short, angry little variable names, easier to read and write for ME because I know the math, then use REFACTOR to replace the names with longer, more descriptive ones at the end, but only for code that is much more informal.
I think it depends almost entirely upon the audience for whom you're writing -- and don't ever mistake the compiler for the audience either. If your code is likely to be maintained by more or less "general purpose" programmers who may not/probably won't know much about physics so they won't recognize what v and r mean, then it's probably better to expand them to be recognizable for non-physicists. If they're going to be physicists (or, for another example, game programmers) for whom the textbook abbreviations are clear and obvious, then use the abbreviations. If you don't know/can't guess which, it's probably safer to err on the side of the names being longer and more descriptive.
I vote for the "book" version. 'v' and 'r' etc are pretty well understood as acronymns for velocity and radius and is more compact.
How far would you take it?
Most (non-greek :-)) keyboards don't provide easy access to Δ, but it's valid as part of an identifier in some languages (e.g. C#):
int Δv;
int Δx;
Anyone coming afterwards and maintaining the code may curse you every day. Similarly for a lot of other symbols used in maths. So if you're not going to use those actual symbols (and I'd encourage you not to), I'd argue you ought to translate the rest, where it doesn't make for code that's too verbose.
In addition, what if you need to combine algorithms, and those algorithms have conflicting usage of variables?
A compromise could be to code and debug as contained in the book, and then perform a global search and replace for all of your variables towards the end of your development, so that it is easier to read. If you do this I would change the names of the variables slightly so that it is easier to change them later.
e.g A_c# = v#*v#/r#