Using variable input for str_extract_all in R - r

I am pretty green when it comes to R and coding in general. I've been working on a CS project recently for a linguistics course through which I'm finding the words that surround various natural landscape words in The Lord of the Rings. For instance, I'm interested in finding the descriptive words used around words like "stream", "mountain", etc.
Anyhow, to extract all of these words from the text, I've been working off of this post. When running this command by itself, it works:
stringr::str_extract_all(text, "([^\\s]+\\s){4}stream(\\s[^\\s]+){6}")
where "stream" is the specific word I'm going after. The numbers before and after specify how many words before and after I want to extract along with it.
However, I'm interested in combining this (and some other things) into a single function, where all you need to plug in the text you want to search, and the word you want to get context for. However, as far as I've tinkered, I can't get anything other than a specific word to work in the above code. Would there be a way to, in the context of writing a function in R, include the above code, but with a variable input, for instance
stringr::str_extract_all(text, "([^\\s]+\\s){4}WORD(\\s[^\\s]+){6}")
where WORD is whatever you specify in the overall function:
function(text,WORD)
I apologize for the generally apparent newb-ness of this post. I am very new to all of this but would greatly appreciate any help you could offer.

This is what you are looking for, if I understood you correctly,
my_fun <- function(input_text, word) {
stringr::str_extract_all(
string = input_text,
pattern = paste("([^\\s]+\\s){4}", word, "(\\s[^\\s]+){6}", sep = "")
)
}
May the light of Eärendil ever shine upon you!

Related

How to detect grammatical elements in a corpus

I'm working with a big corpus in RStudio and the next phase of our research includes the detection of some grammatical elements and its frequency in the texts. We want to detect the frequency of occurrence of things like the use of abstract nouns or deontic modalities which include the auxiliary verbs ‘must’, ‘have to’, ‘may’, ‘can’, ‘should’, ‘ought to ’, etc. I would like to capture its possible conjugation, i.e., not only 'she have to' but 'she had to'; not only 'he can' but 'he could'. I guess it could be done using some simple RegEx such as
She ha(ve|d) to
He c(an|ould)
...right? The problem is 1) I'm not sure whether this can be done (I guess it can be) and 2) which library should I use to do that.
I have thought I could make a dictionary and run it to the whole corpus but 1) and 2) are still here.

Is there any way to replicate Rstudio's integrated search function as a code?

For context, I asked a question earlier today about matching company names with various variations against a big list with a lot of different company names by using the "stringdist" function from the stringdist package, in order to identify the companies in that big list. This is the question I asked.
Unfortunately, I have not been able to make any improvements to my code, which is why I'm starting to look away from stringdist and try something completely different.
I use Rstudio, and I've noticed that the internal search function in that program is much more effective:
As you can see by the picture, simply searching for the company name in the top right corner gives me the output that I'm looking for, such as the longer name "AMMINEX EMISSIONS..." and "AMMINEX AS".
However, in my previous attempt with the stringdist function (see the link to my previous question) I would get results like "LAMINEX" which are not at all relevant, but would appear before the more useful matches:
So it seems like using the algorithm that Rstudio uses is much more efficient in my case, however I'm not quite sure if it's possible to replicate this algorithm in code form, instead of having to manually search for each company.
Assuming I have a data frame that looks like this:
Company_list <- data.frame(Companies=c('AMMINEX', 'Microsoft', 'Apple'))
What would be a way for me to search for all 3 companies at the same time and get the same type of results in a data frame, like Rstudio does in the first image?
From your description of which results are good or bad, it sounds like you like exact matches of a substring rather than things that are close on those distance measures. In that case you can imitate Rstudio's search function with grepl
library(tidyverse)
demo.df <- data.frame(name = paste(rep(c("abc","jkl","xyz"), each=4), sample(1:100,4*3)), limbs=1:4*3)
demo.df%>%filter(grepl('abc|xyz',name))
where the pipe in the grepl pattern string means 'or', letting you search for multiple companies at the same time. So, to search for the names from the example data frame this string would be paste0(Company_list$Companies,collapse="|") Is this what you're after?

TM, Quanteda, text2vec. Get strings on the left of term in wordlist according to regex pattern

I would like to analyse a big folder of texts for the presence of names, addressess and telephone numbers in several languages.
These will usually be preceded with a word "Address", "telephone number", "name", "company", "hospital", "deliverer". I will have a dictionary of these words.
I am wondering if text mining tools would be perfect for the job.
I would like to create a Corpus for all these documents and then find texts that meet specific (i am thinking about regex criteria) on the right or down of the given dictionary entry.
Is there such a syntax in data mining packages in R, ie. to get the strings on the right or down of the wordlist entry, the strings that meet a specific pattern?
If not, would be more suitable tool in R to do the job?
Two options with quanteda come to mind:
Use kwic with your list of target patterns, with a window big enough to capture the size after the term that you want. This will return a data.frame that you can use the keyword and post columns from for your analysis. You can also construct a corpus directly from this object (corpus(mykwic)) and then focus on the new post docvar which will contain the text you want.
Use corpus_segment where you use the target word list to create a "tag" type, and anything following this tag, until the next tag, will be reshaped into a new document. This works well but is a bit trickier to configure, since you will need to get the regex right for the tag.

Easy export and table formatting of R dataframe to Word? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Is there an easy way to automatize the conversion of a R dataframe to a pretty Word table in APA format for publishing manuscripts? I'm currently doing this by saving the table in a csv, opening that in excel, copying the excel table to Word, and formatting it there, but I'm hoping there would be a way to automatize the formatting in R, so that when I convert it to Word, it would already be in APA format, because Word sucks in automatization.
Basically, I want to continue writing the manuscript itself in Word, while doing my analyses in R. Then gather all the results in R to a table (with manually modifiable formatting) by a script and convert it to whatever format I could then simply copy-paste to Word (so that the formatting actually holds). When I need to modify the table, I would make the changes in R and then just run the script again without the need to do any changes in Word.
I don't want to learn LaTeX, because everyone in my field uses Word with features like track changes, and I use Zotero add-in for citations, so it's simpler to just keep the writing separate from the analyses. Also, I am a psychologist, not a coder, so learning a lot of new technologies just for this is probably not worth the effort for me. Typically with new technologies come new technical problems, and I am aiming to make my workflow quicker, but not at the cost of unpredictability (which may make it slower exactly at the moment when I cannot afford it).
I found a R+knitr+rmarkdown+pander+pandoc solution "with as little overhead as possible", but it seems to be quite heavy still because I don't know any of those technologies apart from R. And I'm not eager to start learning all that, as it seems to be aimed for doing the writing and all in R to the very end, while I want to separate my writing and my code - I never need code in my writing, only the result tables. In addition, based on the examples, it seems to fetch the values directly from R code (e.g., from summary() to create a descriptive table), while I need to be able to tinker with my table manually before converting it, for instance, writing the title and notes (like a specific note to one cell and explaining it in the bottom). I also found R2wd, but it seems to be an older attempt for the same "whole workflow in R" problem as the solution above. SWord does not seem to be working anymore.
Any suggestions?
(Just to let you know, I am the author of the packages I recommend you...)
You can use package ReporteRs to output your table to Word. See here a tutorial (not mine):
http://www.sthda.com/english/wiki/create-and-format-word-documents-using-r-software-and-reporters-package
Objects FlexTable let you format and arrange tables easily with some standard R code. For example, to set the 2nd column in bold, the code looks like:
myFlexTable[, 2] = textBold()
There are (old) examples here:
http://davidgohel.github.io/ReporteRs/flextable_examples.html
These objects can be added to a Word report using the function addFlexTable. The word report can be generated with function writeDoc.
If you are working in RStudio, you can print the object and it will be rendered in the html viewer so you can export it in Word when you are satisfied with its content.
You can even add real Word footnotes (see the link below)
http://davidgohel.github.io/ReporteRs/pot_objects.html#pot_footnotes
If you need more tabular output, I recommend you also the rtable package that handles xtable objects (and other things I have to develop to satisfy my colleagues or customers) - a quick demo can be seen here:
http://davidgohel.github.io/tabular/
Hope it helps...
I have had the same need, and I have ended up using the package htmlTable, which is quite 'cost-efficient'. This creates a HTML table (in RStudio it is created in the "Viewer" windows in the bottom right which I just mark using the mouse copy-paste to Word. (Start marking form the bottom of the table and drag the mouse upwards, that way you are sure to include the start of the HTML code.) Word handles these tables quite nicely. The syntax of is quite simple involving just the function htmlTable(), but is still able to make somewhat more complex tables, such as grouped rows and primary and secondary column headers (i.e. column headers spanning more than one column). Check out the examples in the vignette.
One note of caution: htmlTable will not work will factor variables, i.e., they will come out as integer numbers (according to factor levels). So read the data using stringsAsFactors = FALSE or convert them using as.character().
Including trailing zeroes can be done using the txtRound function. Example:
mini_table <- data.frame(Name="A", x=runif(20), stringsAsFactors = FALSE)
txt <- txtRound(mini_table, 2)
It is not completely straightforward to assign formatting soch as bold or italics, but it can be done by wrapping the table contents in HTML code. If you for instance want to make an entire column bold, it can be done like this (please note the use of single and double quotation marks inside paste0):
library(plyr)
mini_table <- data.frame(Name="A", x=runif(20), stringsAsFactors = FALSE)
txt <- txtRound(mini_table, 2)
txt$x <- aaply(txt$x, 1, function(x)
paste0("<span style='font-weight:bold'>", x, "</span")
)
htmlTable(txt)
Of course, that would be easier to to in Word. However, it is more interesting to add formatting to numbers according to some criteria. For instance, if we want to emphasize all values of x that are less than 0.2 by applying bold font, we can modify the code above as follows:
library(plyr)
mini_table <- data.frame(Name="A", x=runif(20), stringsAsFactors = FALSE)
txt <- txtRound(mini_table, 2)
txt$x <- aaply(txt$x, 1, function(x)
if (as.numeric(x)<0.2) {
paste0("<span style='font-weight:bold'>", x, "</span>")
} else {
paste0("<span>", x, "</span>")
})
htmlTable(txt)
If you want even fancier emphasis, you can for instance replace the bold font by red background color by using span style='background-color: red' in the code above. All these changes carry over to Word, at least on my computer (Windows 7).
The short answer is "not really." I've never had much luck getting well formatted tables into MS Word. The best approach I can offer you requires using Rmarkdown to render your tables into an HTML file. You can copy and paste you results from the HTML file to MS Word, but I make no guarantees about how well the formatting will follow.
To format your tables, you can try something like the xtable package, or the pixiedust package. But again, no guarantees that the formatting will transfer.

Find HEX patterns and number of occurrences

I'd like to find patterns and sort them by number of occurrences on an HEX file I have.
I am not looking for some specific pattern, just to make some statistics of the occurrences happening there and sort them.
DB0DDAEEDAF7DAF5DB1FDB1DDB20DB1BDAFCDAFBDB1FDB18DB23DB06DB21DB15DB25DB1DDB2EDB36DB43DB59DB32DB28DB2ADB46DB6FDB32DB44DB40DB50DB87DBB0DBA1DBABDBA0DB9ADBA6DBACDBA0DB96DB95DBB7DBCFDBCBDBD6DB9CDBB5DB9DDB9FDBA3DB88DB89DB93DBA5DB9CDBC1DBC1DBC6DBC3DBC9DBB3DBB8DBB6DBC8DBA8DBB6DBA2DB98DBA9DBB9DBDBDBD5DBD9DBC3DB9BDBA2DB84DB83DB7DDB6BDB58DB4EDB42DB16DB0DDB01DB02DAFCDAE9DAE5DAD9DAE2DAB7DA9BDAA6DA9EDAAADAC9DACADAC4DA92DA90DA84DA89DA93DAA9DA8CDA7FDA62DA53DA6EDA
That's an excerpt of the HEX file, and as an example I'd like to get:
XX occurrences of BDBDBD
XX occurrences of B93D
Is there a way to mine the file to generate that output?
Sure. Use a sliding window to create the counts (The link is for Perl, but it seems general enough to understand the algorithm). Your patterns are named N-grams. You will have to limit the maximal pattern, though.
This is a pretty classic CS problem. The code in general is non-trivial to implement as it will require at least one full parse of the sequence, and depending on your efficiency and memory/processor constraints might require several. See here.
You will need to partition your input string in some way to ensure that you get a good subsequence across it.
If there is a specific problem we might be able to help more, but the general strategy is in the Wikipedia article above.
You can use Regular Expressions to make a pattern to search for.
The regex needed would be very simple. Just use the exact phrase you're searching for. Then there should be a regular expression function in the language you're using (you didn't specify) that can count the number of matches.
Use that to create a simple counter.

Resources