How to get the input word as output word in neural network - recurrent-neural-network

I am trying to get the input word as output word with LSTM(copy a word). For that, I have seen some snippets. In those examples, I observed that we need to convert characters of a word into integers. But Those examples were written to predict the next characters.
an example of what I am trying
input_word = 'algebra'
pass the input_word through some network
output_word = 'algebra'
This is the link I was tried
(https://machinelearningmastery.com/develop-character-based-neural-language-model-keras/).
Any Idea or links about the problem is helpful.

Related

Extract text after blank line and specific word in R

I have a dataset with multiple news articles. Each news article has classification at the bottom of the text. I want to extract the subject-tags:
Classification
Language: ENGLISH
Publication-Type: Newswire
Subject: MANAGERS & SUPERVISORS (90%); POLLS & SURVEYS (90%); HUMAN
MACHINE INTERACTION (78%)
Company: ABC-Company
The number of tags can substantially differ across the dataset. Also not all news articles have the information on publication type or language.
So far I've tried:
y <- str_extract(x$Text, "Subject: .*")
This worked well, until I found some news reports contain the "subject" part in their body. R then extracts whatever text comes after that.
I am now looking for a way to adjust my code to account for the blank line that always precedes "Subject: ".
Simply adding a blank like did not work (it works in Python, that is why I tried).
Best case, I would adjust the above code to extract the information after the word subject only if that word in preceded by a blank line and only if those are after the word classification. This would make my code more robust.
I believe I found a way. I added \n to my string.
y <- str_extract(x$Text, "\nSubject: .*")
Worked for me.

Using variable input for str_extract_all in R

I am pretty green when it comes to R and coding in general. I've been working on a CS project recently for a linguistics course through which I'm finding the words that surround various natural landscape words in The Lord of the Rings. For instance, I'm interested in finding the descriptive words used around words like "stream", "mountain", etc.
Anyhow, to extract all of these words from the text, I've been working off of this post. When running this command by itself, it works:
stringr::str_extract_all(text, "([^\\s]+\\s){4}stream(\\s[^\\s]+){6}")
where "stream" is the specific word I'm going after. The numbers before and after specify how many words before and after I want to extract along with it.
However, I'm interested in combining this (and some other things) into a single function, where all you need to plug in the text you want to search, and the word you want to get context for. However, as far as I've tinkered, I can't get anything other than a specific word to work in the above code. Would there be a way to, in the context of writing a function in R, include the above code, but with a variable input, for instance
stringr::str_extract_all(text, "([^\\s]+\\s){4}WORD(\\s[^\\s]+){6}")
where WORD is whatever you specify in the overall function:
function(text,WORD)
I apologize for the generally apparent newb-ness of this post. I am very new to all of this but would greatly appreciate any help you could offer.
This is what you are looking for, if I understood you correctly,
my_fun <- function(input_text, word) {
stringr::str_extract_all(
string = input_text,
pattern = paste("([^\\s]+\\s){4}", word, "(\\s[^\\s]+){6}", sep = "")
)
}
May the light of Eärendil ever shine upon you!

Extract Text from a pdf only English text Canadian Legislation R

I'm trying to extract data from a Canadian Act for a project (in this case, the Food and Drugs Act), and import it into R. I want to break it up into 2 parts. 1st the table of contents (pic 1). Second, the information in the act (pic 2). But I do not want the French part (je suis désolé). I have tried using tabulizer extract_area(), but I don't want to have to select the area by hand 90 times (I'm going to do this for multiple pieces of legislation).
Obviously I don't have a minimal reproducible example coded out... But the pdf is downloadable here: https://laws-lois.justice.gc.ca/eng/acts/F-27/
Option 2 is to write something to pull it out via XML, but I'm a bit less used to working with XML files. Unless it's incredibly annoying to do using either pdftools or tabulizer, I'd prefer the answer using one of those libraries (mostly for learning purposes).
I've seen some similarish questions on stackoverflow, but they're all confusingly written/designed for tables, of which this is not. I am not a quant/data science researcher by training, so an explanation would be super helpful (but not required).
Here's an option that reads in the pdf text and detects language. You're probably going to have to do a lot of text cleanup after reading in the pdf. Assume you don't care about retaining formatting.
library(pdftools)
a = pdf_text('F-27.pdf')
#split text to get sentence chunks, mostly.
b = sapply(a,strsplit,'\r\n')
#do a bunch of other text cleanup, here's an example using the third list element. You can expand this to cover all of b with a loop or list function like sapply.
#Two spaces should hopefully retain most sentence-like fragments, you can get more sophisticated:
d = strsplit(b[[3]], ' ')[[1]]
library(cld3) #language tool to detect french and english
x = sapply(d,detect_language)
#Keep only English
x[x=='en']

Can I count and list how many times words were used in an excel document?

I am working on analyzing some text data from a Ticketing system. I am pulling out long text fields from the tickets and need to analyze which words are being used and which ones are being used the most. But I need it to list all of the words.
The file format is in Excel and I have taken the file and using tm, I have made some edits to the data and removed some stop words and other words that aren't really important to the data I am looking for. I have already made this into a corpus.
When I do the following code it kind of gives me what I need but it does not actually give me all of the words. I know that this is going to be a long list, but that is fine.
dtm <- DocumentTermMatrix(hardwareCN.Clean)
dtmDataFrame1 <- as.data.frame(inspect(dtm))
colSums(dtmDataFrame1)
This gives me only about 10 words, but I know that there are many many more than that. I also then need to be able to export this to share.
Thanks

Input data for capture-mark-recapture (CMR) in package: marked

I am trying to get abundance estimates of rodents for my survey area using capture mark recapture. I am using the package "marked" but keep running into a problem where my capture histories are changed to numbers eg. 0010 becomes 10.
I have tried csv's, txt files, changing numbers to text in excel and can't seem to come right. When I am able to get data into R as a capture history as opposed to a number I get "Incorrect ch values in data:10;FMU".
I think the error lies in the input data, and I'm wondering if anyone perhaps has an example of their input data for something similar?
Any suggestions, would be greatly appreciated.

Resources