text extraction with R - r

I want to extract lines that starts with an "i" in a text file
I tried this
i <- grep("^i.", text,value = TRUE)
all the lines including iii and ii are extracted. How to solve this problem?
data
"i. provides substantial identification and comment upon significant aspects of texts \\"
"ii. provides substantial identification and comment upon the creator\\'92s choices \\"
"iii. sufficiently justifies opinions and ideas with examples and explanations; uses accurate terminology

We need to escape the . to read it as character .. Otherwise, it means any character.
grep('^i\\.', text, value=TRUE)
#[1] "i. provides substantial identification and comment upon significant aspects of texts \\"
data
text <- c("i. provides substantial identification and comment upon significant aspects of texts \\",
"ii. provides substantial identification and comment upon the creator\\'92s choices \\",
"iii. sufficiently justifies opinions and ideas with examples and explanations; uses accurate terminology ")

Related

How to extract sentences between point and brackets with R?

I Have:
Stringa=" This is different from primary data created specifically by researchers to reflect concepts that are higher-order and more abstract(Lee,1991;Walsham,1995).Given the major differences between big data and research-collected data, it is surprising how little discussion has arisen about how using big data should change the practice of theory-informed IS research. Some scholars have noted that the very nature of inquiry is likely to change, given that large data sets, advanced algorithms, and powerful computing capabilities can initiate and refine questions without human intervention (Agarwal & Dhar, 2014). Other commentators argue that the scientific method is likely to become obsolete, as with the “availability of huge amounts of data, along with the statistical tools to crunch these numbers … science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson, 2008). Perhaps “scientists no longer have to make educated guesses, construct hypotheses and models, test them in data-based experiments andexamples. Instead, they canmine thecomplete setof data forpatterns that reveal effects, producing scientificconclusions without further experimentation” (Prensky, 2009). "
Desidered Output:
[1]This is different from primary data created specifically by researchers to reflect concepts that are higher-order and more abstract(Lee,1991;Walsham,1995).
[2]Some scholars have noted that the very nature of inquiry is likely to change, given that large data sets, advanced algorithms, and powerful computing capabilities can initiate and refine questions without human intervention (Agarwal & Dhar, 2014)
[3] Other commentators argue that the scientific method is likely to become obsolete, as with the “availability of huge amounts of data, along with the statistical tools to crunch these numbers … science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson, 2008)
[4]Instead, they canmine thecomplete setof data forpatterns that reveal effects, producing scientific conclusions without further experimentation” (Prensky, 2009)
I use:unlist(str_extract_all(string =Stringa, pattern = "\\. [A-Za-z][^()]+ \\("))
But it doesn't work
I don’t want extract ‘Given the major differences between big data and research-collected data, it is surprising how little discussion has arisen about how using big data should change the practice of theory-informed IS research. ‘ and ‘Perhaps “scientists no longer have to make educated guesses, construct hypotheses and models, test them in data-based experiments andexamples. ‘
If there are no abbreviations in the text, you may use
regmatches(Stringa, gregexpr("[^.?!\\s][^.!?]*?\\([^()]*\\)", Stringa, perl=TRUE))
[[1]]
[1] "This is different from primary data created specifically by researchers to reflect concepts that are higher-order and more abstract(Lee,1991;Walsham,1995)"
[2] "Some scholars have noted that the very nature of inquiry is likely to change, given that large data sets, advanced algorithms, and powerful computing capabilities can initiate and refine questions without human intervention (Agarwal & Dhar, 2014)"
[3] "Other commentators argue that the scientific method is likely to become obsolete, as with the “availability of huge amounts of data, along with the statistical tools to crunch these numbers … science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson, 2008)"
[4] "Instead, they canmine thecomplete setof data forpatterns that reveal effects, producing scientificconclusions without further experimentation” (Prensky, 2009)"
See the regex demo and the R demo.
Details
[^.?!\\s] - any char but ., ?, ! and whitespace
[^.!?]*? - any 0+ chars other than ., ?, ! as few as possible
\([^()]*\) - a (, 0+ chars other than ( and ) and then a ).
We can handle this using grepexpr and regmatches, using the following regex pattern:
.*?\([^)]+\).*?(?=\w|$)
This will capture any content up to the first parenthesis, followed by a (...) term. The script below will capture all such matches in the source text.
m <- gregexpr(".*?\\([^)]+\\).*?(?=\\w|$)", x, perl=TRUE)
regmatches(x, m)
[[1]]
[1] "This is different from primary data created specifically by researchers to reflect concepts that are higher-order and more abstract(Lee,1991;Walsham,1995)."
[2] "Given the major differences between big data and research-collected data, it is surprising how little discussion has arisen about how using big data should change the practice of theory-informed IS research. Some scholars have noted that the very nature of inquiry is likely to change, given that large data sets, advanced algorithms, and powerful computing capabilities can initiate and refine questions without human intervention (Agarwal & Dhar, 2014). "
[3] "Other commentators argue that the scientific method is likely to become obsolete, as with the “availability of huge amounts of data, along with the statistical tools to crunch these numbers … science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” (Anderson, 2008). "
[4] "Perhaps “scientists no longer have to make educated guesses, construct hypotheses and models, test them in data-based experiments andexamples. Instead, they canmine thecomplete setof data forpatterns that reveal effects, producing scientificconclusions without further experimentation”(Prensky, 2009). "

How to remove citation parts that miss parentheses

DATA
mystring1 <- "Other work has shown that, in addition to language-general features such as a decreased speaking rate and an expanded pitch range, clear speech production involves the enhancement of the acoustic-phonetic distance between phonologically contrastive categories 􏰃e.g., Ferguson and Kewley-Port, 2002; Krause and Braida, 2004, Picheny et al, 1986; Smiljanic and Bradlow, 2005, 2007􏰀."
mystring2 <- "Other work has shown that, in addition to language-general features such as a decreased speaking rate and an expanded pitch range, clear speech production involves the enhancement of the acoustic-phonetic distance between phonologically contrastive categories 􏰃e.g., Ferguson and Kewley-Port, 2002; Krause and Braida, 2004, Picheny et al, 1986; Smiljanic and Bradlow, 2005, 2007􏰀. Therefore, reduced sensitivity to any or all of the language-specific acoustic-phonetic dimensions of contrast and clear speech enhancement would yield a diminished clear speech benefit for non-native listeners. This may appear somewhat surprising given that clear speech production was elicited in our studies by instructing the talkers to speak clearly for the sake of listeners with either a hearing impairment or from a different native language background. However, as discussed further in Bradlow and Bent 􏰃2002􏰀, the limits of clear speech as a means of enhancing non-native speech perception likely reflect the “mistuning” that characterizes spoken language communication between native and non-native speakers."
I'd like to receive some help for regular expression. I got some text data. Basically I want to remove citation parts that appear between last word in a sentence and a period. However, parentheses are somehow missing. mystring1 is an example for that. In this example, I want to remove e.g., Ferguson and Kewley-Port, 2002; Krause and Braida, 2004, Picheny et al, 1986; Smiljanic and Bradlow, 2005, 2007􏰀. But this sentence is just one of the sentences in a paragraph. mystring2 contains three more sentences following mystring1. My goal is to remove the citation part from mystring2. But I have not been successful; the pattern is removing more texts than I want. How can I revise regex pattern? Thank you for your help in advance.
# This works for mystring1.
gsub(x = mystring1, pattern = "e\\.g\\.,.*[0-9]{4}(?=.)", replacement = "", perl = T)
[1] "Other work has shown that, in addition to language-general features such as a
decreased speaking rate and an expanded pitch range, clear speech production involves
the enhancement of the acoustic-phonetic distance between phonologically contrastive
categories 􏰃􏰀."
# But this pattern does not work for mystring2; gsub() removes texts more than I want.
gsub(x = mystring2, pattern = "e\\.g\\.,.*[0-9]{4}(?=.)", replacement = "", perl = T)
[1] "Other work has shown that, in addition to language-general features such as a decreased
speaking rate and an expanded pitch range, clear speech production involves the
enhancement of the acoustic-phonetic distance between phonologically contrastive
categories 􏰃􏰀, the limits of clear speech ... (I trimmed texts here) speakers."
I suggest using
\be\.g\.,.*?[0-9]{4}[^\w.]*(?=\.)
See the regex demo.
Details
\be\.g\. - a whole word e.g. (\b is a word boundary)
, - a comma
.*? - any 0+ chars other than line break chars (add (?s) at the pattern start to make it match line breaks, too)
[0-9]{4} - four digits
[^\w.]* - 0+ chars other than word chars and dot
(?=\.) - (a positive lookahead matching a location where) a . must be immediately to the right of the current location.
R demo:
rx <- "\\be\\.g\\.,.*?[0-9]{4}[^\\w.]*(?=\\.)"
gsub(x = mystring1, pattern = rx, replacement = "", perl = TRUE)
## => [1] "Other work has shown that, in addition to language-general features such as a decreased speaking rate and an expanded pitch range, clear speech production involves the enhancement of the acoustic-phonetic distance between phonologically contrastive categories 􏰃."
gsub(x = mystring2, pattern = rx, replacement = "", perl = TRUE)
## => [1] "Other work has shown that, in addition to language-general features such as a decreased speaking rate and an expanded pitch range, clear speech production involves the enhancement of the acoustic-phonetic distance between phonologically contrastive categories 􏰃. Therefore, reduced sensitivity to any or all of the language-specific acoustic-phonetic dimensions of contrast and clear speech enhancement would yield a diminished clear speech benefit for non-native listeners. This may appear somewhat surprising given that clear speech production was elicited in our studies by instructing the talkers to speak clearly for the sake of listeners with either a hearing impairment or from a different native language background. However, as discussed further in Bradlow and Bent 􏰃2002􏰀, the limits of clear speech as a means of enhancing non-native speech perception likely reflect the “mistuning” that characterizes spoken language communication between native and non-native speakers."

why don't we use simple encryption?

What i mean by this is that if i create a Lua program that randomly assigns numbers and letters to a three digit code wouldn't this code then be almost unbreakable(like if someone that wasn't supposed to got it) unless you have the program? sorry if this was already asked could some1 direct me to it.
Simple encryption is not used because it is not sufficiently secure. We use a level of encryption necessary to meet the required security level to successfully defend against attackers.
Attackers range from a curious friend to nation states, think the NSA, GCHQ, KGB & etc.
"Schneier's Law": Anyone, from the most clueless amateur to the best cryptographer, can create an algorithm that he himself can't break.
What you are describing is a called a substitution cipher.
These are broken by using frequency analysis. Because each letter and number is always assigned to the same code, letters in the input will lead to the corresponding codes appearing in the output with the same frequency. The cryptanalyst will study the kinds of data he expects to be input to the cipher, and find common symbols and patterns, then match those with frequent patterns in the output. For example, if the input is English text, the cryptanalyst knows that the most frequent codes represent E, T, A, O, I, … and the most common sequences are "THE", "BE", "OF", "AND", etc.
Codes in the output should occur with uniform probability. When there is a bias in the output, it can be exploited to break the code. One way to avoid this, in basic terms, is to use a different "code book" for each letter in the input. So "E" doesn't always translate to the same code; it would depend on the position of the "E" in the message.

semantic matching strings - using word2vec or s-match?

I have this problem of matching two strings for 'more general', 'less general', 'same meaning', 'opposite meaning' etc.
The strings can be from any domain. Assume that the strings can be from people's emails.
To give an example,
String 1 = "movies"
String 2 = "Inception"
Here I should know that Inception is less general than movies (sort of is-a relationship)
String 1 = "Inception"
String 2 = "Christopher Nolan"
Here I should know that Inception is less general than Christopher Nolan
String 1 = "service tax"
String 2 = "service tax 2015"
At a glance it appears to me that S-match will do the job. But I am not sure if S-match can be made to work on knowledge bases other than WordNet or GeoWordNet (as mentioned in their page).
If I use word2vec or dl4j, I guess it can give me the similarity scores. But does it also support telling a string is more general or less general than the other?
But I do see word2vec can be based on a training set or large corpus like wikipedia etc.
Can some one throw light on the way to go forward?
The current usage of machine learning methods such as word2vec and dl4j for modelling words are based on distributional hypothesis. They train models of words and phrases based on their context. There is no ontological aspects in these word models. At its best trained case a model based on these tools can say if two words can appear in similar contexts. That is how their similarity measure works.
The Mikolov papers (a, b and c) which suggests that these models can learn "Linguistic Regularity" doesn't have any ontological test analysis, it only suggests that these models are capable of predicting "similarity between members of the word pairs". This kind of prediction doesn't help your task. These models are even incapable of recognising similarity in contrast with relatedness (e.g. read this page SimLex test set).
I would say that you need an ontological database to solve your problem. More specifically about your examples, it seems for String 1 and String 2 in your examples:
String 1 = "a"
String 2 = "b"
You are trying to check entailment relations in sentences:
(1) "c is b"
(2) "c is a"
(3) "c is related to a".
Where:
(1) entails (2)
or
(1) entails (3)
In your two first examples, you can probably use semantic knowledge bases to solve the problem. But your third example will probably need a syntactical parsing before understanding the difference between two phrases. For example, these phrases:
"men"
"all men"
"tall men"
"men in black"
"men in general"
It needs a logical understanding to solve your problem. However, you can analyse that based on economy of language, adding more words to a phrase usually makes it less general. Longer phrases are less general comparing to shorter phrases. It doesn't give you a precise tool to solve the problem, but it can help to analyse some phrases without special words such as all, general or every.

Is there any decryption algorithm that uses a dictionary to decrypt an encrypted algorithm?

Well I have been working on an assigment and it states:
A program has to be developed, and coded in C language, to decipher a document written
in Italian that is encoded using a secret key. The secret key is obtained as random
permutation of all the uppercase letters, lowercase letters, numbers and blank space. As
an example, let us consider the following two strings:
Plain: “ABCDEFGHIJKLMNOPQRSTUVXWYZabcdefghijklmnopqrstuvwxyz0123456789 ”
Code: “BZJ9y0KePWopxYkQlRjhzsaNTFAtM7H6S24fC5mcIgXbnLOq8Uid 3EDv1ruVGw”
The secret key modifies only letters, numbers, and spaces of the original document, while
the remaining characters are left unchanged. The document is stored in a text file whose
length is unknown.
The program has to read the document, find the secret key (which by definition is
unknown; the above table is just an example and it is not the key used for preparing the
sample files available on the web course) using a suitable decoding algorithm, and write
the decoded document to a new text file.
And I know that I have to upload an English dictionary into the program but I don't why it has been asked (may be not in that statement but I have to do THAT). My question is, while I can do that program using simple encryption/decryption algorithm then what's the use of uploading the English dictionary in our program? So is there any decryption algorithm that uses a dictionary to decrypt an encrypted algorithm? Or can somebody tell me what approach or algorithm should I use to solve that problem???
An early reply (and also authentic one) will be highly appreciated from you.
Thank you guys.
This is a simple substitution cipher. It can be broken using frequency analysis. The Wikipedia articles explain both concepts thoroughly. What you need to do is:
Find the statistical frequency of characters in Italian texts. If you can't find this published anywhere, you can build it yourself by analyzing a large corpus of Italian texts.
Analyze the frequency of characters in the cipher text, and match it to the statistical data.
The first Wikipedia article links to a set of tools that implement all of the above. You just need to use and possibly adapt it to your use case.
Your cipher is a substitution cipher. That is it substitutes one letter for another.
consider the cipher text
"yjr,1drv2ry1od1q1..."
We can use a dictionary to find the plaintext.
Find punctuation, since a space always follows a comma, you can find the substitution rule for spaces.
which gives you.
"yjr, drv2ry od q..."
Notice the word lengths. Since there only two 1 letter words in the english language the q is probably i or a. "yjr" is probably "why", "the", "how" etc.
We try why with the result
"why, dyv2yw od q..."
There are no english words with two y's, and end in w.
So we try "the" and get
"the, dev2et od q..."
We conclude that the is a likely answer.
Now we search our dictionary for words that start look like ?e??et.
rinse repeat.
That is, find some set of words which fit into the lengths available and do not break each others substitution rules.
Personally I just do the frequency analysis suggested above.
Frequency analysis, as both other respondents said, is the way to go, and you can use digrams and trigrams to make it much stronger. Just grab tons of Italian text from the web and churn ahead! It's really pretty simple programming.

Resources