What does this code below do ? In r language - r

#Read state of union file
speech<-readLines("stateoftheunion1790-2012.txt")
head(speech)
What does this code below do after it reads the file ??? I was told It will give a list where each entry is the text between consecutive ***'s. But what does that mean.
x <- grep("^\\*{3}", speech)
list.speeches <- list()
for(i in 1:length(x)){
if(i == 1){
list.speeches[[i]] <- paste(speech[1:x[1]], collapse = " ")
}else{
list.speeches[[i]] <- paste(speech[x[i-1]:x[i]], collapse = " ")}
}

It looks like you're new to SO; welcome to the community! As #Allan Cameron pointed out, whenever you ask questions, especially if you want great answers quickly, it's best to make your question reproducible. This includes sample data like the output from dput() or reprex::reprex(). Check it out: making R reproducible questions.
I've detailed each part of the code with coding comments. Feel free to ask questions if you've got any.
speech <- readLines("https://raw.githubusercontent.com/jdwilson4/Intro-Data-Science/master/Data/stateoftheunion1790-2012.txt")
head(speech) # print the first 6 rows captured in the object speech
# [1] "The Project Gutenberg EBook of Complete State of the Union Addresses,"
# [2] "from 1790 to the Present"
# [3] ""
# [4] "Character set encoding: UTF8"
# [5] ""
# [6] "The addresses are separated by three asterisks"
x <- grep("^\\*{3}", speech)
# searches speech char vector for indices coinciding with strings of 3 asterisks ***
list.speeches <- list() # create a list to store the results
for(i in 1:length(x)){ # for each index that coincided with three asterisks
if(i == 1){ # if it's the first set of asterisks ***
list.speeches[[i]] <- paste(speech[1:x[1]], collapse = " ")
# capture all vector elements up to the first set of 3 asterisks
# capture file information and who gave each of the speeches
}else{
list.speeches[[i]] <- paste(speech[x[i-1]:x[i]], collapse = " ")}
} # capture the info between each set of subsequent indices
# capture all rows of each speech (currently separated by ***)
# place each complete speech in a different list position

Related

R: find a specific string next to another string with for loop

I have the text of a novel in a single vector, it has been split by words novel.vector.words I am looking for all instances of the string "blood of". However since the vector is split by words, each word is its own string and I don't know to search for adjacent strings in a vector.
I have a basic understanding of what for loops do, and following some instructions from a text book, I can use this for loop to target all positions of "blood" and the context around it to create a tab-delineated KWIC display (key words in context).
node.positions <- grep("blood", novel.vector.words)
output.conc <- "D:/School/U Alberta/Classes/Winter 2019/LING 603/dracula_conc.txt"
cat("LEFT CONTEXT\tNODE\tRIGHT CONTEXT\n", file=output.conc) # tab-delimited header
#This establishes the range of how many words we can see in our KWIC display
context <- 10 # specify a window of ten words before and after the match
for (i in 1:length(node.positions)){ # access each match...
# access the current match
node <- novel.vector.words[node.positions[i]]
# access the left context of the current match
left.context <- novel.vector.words[(node.positions[i]-context):(node.positions[i]-1)]
# access the right context of the current match
right.context <- novel.vector.words[(node.positions[i]+1):(node.positions[i]+context)]
# concatenate and print the results
cat(left.context,"\t", node, "\t", right.context, "\n", file=output.conc, append=TRUE)}
What I am not sure how to do however, is use something like an if statement or something to only capture instances of "blood" followed by "of". Do I need another variable in the for loop? What I want it to do basically is for every instance of "blood" that it finds, I want to see if the word that immediately follows it is "of". I want the loop to find all of those instances and tell me how many there are in my vector.
You can create an index using dplyr::lead to match 'of' following 'blood':
library(dplyr)
novel.vector.words <- c("blood", "of", "blood", "red", "blood", "of", "blue", "blood")
which(grepl("blood", novel.vector.words) & grepl("of", lead(novel.vector.words)))
[1] 1 5
In response to the question in the comments:
This certainly could be done with a loop based approach but there is little point in re-inventing the wheel when there are already packages better designed and optimized to do the heavy lifting in text mining tasks.
Here is an example of how to find how frequently the words 'blood' and 'of' appear within five words of each other in Bram Stoker's Dracula using the tidytext package.
library(tidytext)
library(dplyr)
library(stringr)
## Read Dracula into dataframe and add explicit line numbers
fulltext <- data.frame(text=readLines("https://www.gutenberg.org/ebooks/345.txt.utf-8", encoding = "UTF-8"), stringsAsFactors = FALSE) %>%
mutate(line = row_number())
## Pair of words to search for and word distance
word1 <- "blood"
word2 <- "of"
word_distance <- 5
## Create ngrams using skip_ngrams token
blood_of <- fulltext %>%
unnest_tokens(output = ngram, input = text, token = "skip_ngrams", n = 2, k = word_distance - 1) %>%
filter(str_detect(ngram, paste0("\\b", word1, "\\b")) & str_detect(ngram, paste0("\\b", word2, "\\b")))
## Return count
blood_of %>%
nrow
[1] 54
## Inspect first six line number indices
head(blood_of$line)
[1] 999 1279 1309 2192 3844 4135

Check for multiple words in string match for text search in r

Presently I have a code which works for one word search, can we search multiple words and write those matched words in a dataframe? (for clarification, please refer to this post) this is akrun's solution which works for one word.
Here is the code:
library(pdftools)
library(tesseract)
All_files <- Sys.glob("*.pdf")
v1 <- numeric(length(All_files))
word <- "school"
df <- data.frame()
Status <- "Present"
for (i in seq_along(All_files)){
file_name <- All_files[i]
cnt <- pdf_info(All_files[i])$pages
print(cnt)
for(j in seq_len(cnt)){
img_file <- pdftools::pdf_convert(All_files[i], format = 'tiff', pages = j, dpi = 400)
text <- ocr(img_file)
ocr_text <- capture.output(cat(text))
check <- sapply(ocr_text, paste, collapse="")
junk <- dir(path= paste0(path, "/tiff"), pattern="tiff")
file.remove(junk)
br <-if(length(which(stri_detect_fixed(tolower(check),tolower(word)))) <= 0) "Not Present"
else "Present"
print(br)
if(br=="Present") {
v1[i] <- j
break}
}
Status <- if(v1[i] == 0) "Not Present" else "Present"
pages <- if(v1[i] == 0) "-" else
paste0(tools::file_path_sans_ext(basename(file_name)), "_", v1[i])
words <- if(v1[i] == 0) "-" else word
df <- rbind(df, cbind(file_name = basename(file_name),
Status, pages = pages, words = words))
}
Here we are searching for only one word i.e school. Can we search for multiple words like school, gym, swimming pool?
Expected O/P
fileName Status Page Words TEXT
test.pdf Present test_1 gym I go gym regularly
test.pdf Present test_3 school Here is the next school
test1.pdf Present test1_4 swimming pool In swimming pool
test1.pdf Present test1_7 gym next to Gold gym
test2.pdf Not Present - -
fileName=Name of the File
Status=If any word is found then "Present" else "Not Present"
Page=Here "_1", "_3" defines the page number on which the word was found;; on page "test_1" word "gym" was found and on page "test_3" word "school" was found.
Words= Which all words were found ;; like only "gym" and "school" were found on page 1 and 3 of test.pdf file AND only "swimming pool" and "gym" were found on page 4 and 7 of test1.pdf file.
TEXT = It is the text in which the word was found
Any suggestion on the same will be helpful.
Thanks
You go through every PDF in your directory with the outside loop. Then you go through all pages of the PDF and extract the text in the inner loop. You want to check for every document whether at least one page contains either school, gym or swimming pool. The returned values you want to use are:
a vector of the length of the number of PDF documents containing either Present or Not present.
Three vector with some strings, containing information on which word occurs where and when.
Right?
You can skip a couple of steps in your loop, especially while transforming PDFs to TIFFs and reading texts from them with ocr:
all_files <- Sys.glob("*.pdf")
strings <- c("school", "gym", "swimming pool")
# Read text from pdfs
texts <- lapply(all_files, function(x){
img_file <- pdf_convert(x, format="tiff", dpi=400)
return( tolower(ocr(img_file)) )
})
# Check for presence of each word in checkthese
pages <- words <- vector("list", length(texts))
for(d in seq_along(texts)){
for(w in seq_along(strings)){
intermed <- grep(strings[w], texts[[d]])
words[[d]] <- c(words[[d]],
strings[w][ (length(intermed) > 0) ])
pages[[d]] <- unique(c(pages[[d]], intermed))
}
}
# Organize data so that it suits your wanted output
fileName <- tools::file_path_sans_ext(basename(all_files))
Page <- Map(paste0, fileName, "_", pages, collapse=", ")
Page[!grepl(",", Page)] <- "-"
Page <- t(data.frame(Page))
Words <- sapply(words, paste0, collapse=", ")
Status <- ifelse(sapply(Words, nchar) > 0, "Present", "Not present")
data.frame(row.names=fileName, Status=Status, Page=Page, Words=Words)
# Status Page Words
# pdf1 Present pdf1_1, pdf1_2 gym, swimming pool
# pdf2 Present pdf2_2, pdf2_5, pdf2_8, pdf2_3, pdf2_6 school, gym, swimming pool
It's not as readable as I'd like it to be. Probably because little requirements w.r.t. the output require minor intermediate steps that make the code seem a bit chaotic. It works well, though

Parsing XML for Ancient Greek Plays with speaker and dialogue

I am currently trying to read Greek plays which are available online as XML files into a data frame with a dialogue and speaker column.
I run the following commands to download the XML and parse the dialogue and speakers.
library(XML)
library(RCurl)
url <- "http://www.perseus.tufts.edu/hopper/dltext?doc=Perseus%3Atext%3A1999.01.0186"
html <- getURL(url, followlocation = TRUE)
doc <- htmlParse(html, asText=TRUE)
plain.text <- xpathSApply(doc, "//p", xmlValue)
speakersc <- xpathSApply(doc, "//speaker", xmlValue)
dialogue <- data.frame(text = plain.text, stringsAsFactors = FALSE)
speakers <- data.frame(text = speakersc, stringsAsFactors = FALSE)
However, I then encounter a problem. The dialogue will yield 300 rows (for 300 distinct lines in the play), but the speaker will yield 297.
The reason for the problem is due to the structure of the XML as reproduced below, where the <speaker> tag is not repeated for continued dialogue interrupted by stage direction. Because I have to separate the dialogue
with the <p> tag, it yields two dialogue rows, but only one speaker row, without duplicating the speaker accordingly.
<speaker>Creon</speaker>
<stage>To the Guard.</stage>
-<p>
You can take yourself wherever you please,
<milestone n="445" unit="line" ed="p"/>
free and clear of a heavy charge.
<stage>Exit Guard.</stage>
</p>
</sp>
-<sp>
<stage>To Antigone.</stage>
<p>You, however, tell me—not at length, but briefly—did you know that an edict had forbidden this?</p>
</sp>
How can I parse the XML so the data will correctly yield the same number of dialogue rows for the same number of corresponding speaker rows?
For the above example, I would like the resulting data frame to either contain two rows for Creon's dialogue corresponding to the two lines of dialogue prior and after the stage direction, or one row which treats Creon's dialogue as one line ignoring the separation due to the stage direction.
Thank you very much for your help.
Consider using xpath's forward looking following-sibling to search for the next <p> tag when speaker is empty, all while iterating through <sp> which is the parent to <speaker> and <p>:
# ALL SP NODES
sp <- xpathSApply(doc, "//body/descendant::sp", xmlValue)
# ITERATE THROUGH EACH SP BY NODE INDEX TO CREATE LIST OF DFs
dfList <- lapply(seq_along(sp), function(i){
data.frame(
speakers = xpathSApply(doc, paste0("concat(//body/descendant::sp[",i,"]/speaker,'')"), xmlValue),
dialogue = xpathSApply(doc, paste0("concat(//body/descendant::sp[",i,"]/speaker/following-sibling::p[1], ' ',
//body/descendant::sp[position()=",i+1," and not(speaker)]/p[1])"), xmlValue)
)
# ROW BIND LIST OF DFs AND SUBSET EMPTY SPEAKER/DIALOGUE
finaldf <- subset(do.call(rbind, dfList), speakers!="" & dialogue!="")
})
# SPECIFIC ROWS IN OP'S HIGHLIGHT
finaldf[85,]
# speakers
# 85 Creon
#
# dialogue
# 85 You can take yourself wherever you please,free and clear of a heavy
# charge.Exit Guard. You, however, tell me—not at length, but
# briefly—did you know that an edict had forbidden this?
finaldf[86,]
# speakers dialogue
# 87 Antigone I knew it. How could I not? It was public.
Another option is the read the url and make some updates before parsing XML, in this case replace milestone tags with a space to avoid mashing words together, remove stage tags and then fix the sp node without a speaker
x <- readLines(url)
x <- gsub("<milestone[^>]*>", " ", x) # add space
x <- gsub("<stage>[^>]*stage>", "", x) # no space
x <- paste(x, collapse = "")
x <- gsub("</p></sp><sp><p>", "", x) # fix sp without speaker
Now the XML has the same number of sp and speaker tags.
doc <- xmlParse(x)
summary(doc)
p sp speaker div2 placeName
299 297 297 51 25 ...
Finally, get the sp nodes and parse speaker and paragraph.
sp <- getNodeSet(doc, "//sp")
s1 <- sapply( sp, xpathSApply, ".//speaker", xmlValue)
# collapse the 1 node with 2 <p>
p1 <- lapply( sp, xpathSApply, ".//p", xmlValue)
p1 <- trimws(sapply(p1, paste, collapse= " "))
speakers <- data.frame(speaker=s1, dialogue = p1)
speaker dialogue
1 Antigone Ismene, my sister, true child of my own mother, do you know any evil o...
2 Ismene To me no word of our friends, Antigone, either bringing joy or bringin...
3 Antigone I knew it well, so I was trying to bring you outside the courtyard gat...
4 Ismene Hear what? It is clear that you are brooding on some dark news.
5 Antigone Why not? Has not Creon destined our brothers, the one to honored buri...
6 Ismene Poor sister, if things have come to this, what would I profit by loose...
7 Antigone Consider whether you will share the toil and the task.
8 Ismene What are you hazarding? What do you intend?
9 Antigone Will you join your hand to mine in order to lift his corpse?
10 Ismene You plan to bury him—when it is forbidden to the city?
...

NLP - identifying and replacing words (synonyms) in R

I have problem with code in R.
I have a data-set(questions) with 4 columns and over 600k observation, of which one column is named 'V3'.
This column has questions like 'what is the day?'.
I have second data-set(voc) with 2 columns, of which one column name 'word' and other column name 'synonyms'. If In my first data-set (questions )exists word from second data-set(voc) from column 'synonyms' then I want to replace it word from 'word' column.
questions = cbind(V3=c("What is the day today?","Tom has brown eyes"))
questions <- data.frame(questions)
V3
1 what is the day today?
2 Tom has brown eyes
voc = cbind(word=c("weather", "a","blue"),synonyms=c("day", "the", "brown"))
voc <- data.frame(voc)
word synonyms
1 weather day
2 a the
3 blue brown
Desired output
V3 V5
1 what is the day today? what is a weather today?
2 Tom has brown eyes Tom has blue eyes
I wrote simple code but it doesn't work.
for (k in 1:nrow(question))
{
for (i in 1:nrow(voc))
{
question$V5<- gsub(do.call(rbind,strsplit(question$V3[k]," "))[which (do.call(rbind,strsplit(question$V3[k]," "))== voc[i,2])], voc[i,1], question$V3)
}
}
Maybe someone will try to help me? :)
I wrote second code, but it doesn't work too..
for( i in 1:nrow(questions))
{
for( j in 1:nrow(voc))
{
if (grepl(voc[j,k],do.call(rbind,strsplit(questions[i,]," "))) == TRUE)
{
new=matrix(gsub(do.call(rbind,strsplit(questions[i,]," "))[which(do.call(rbind,strsplit(questions[i,]," "))== voc[j,2])], voc[j,1], questions[i,]))
questions[i,]=new
}
}
questions = cbind(questions,c(new))
}
First, it is important that you use the stringsAsFactors = FALSE option, either at the program level, or during your data import. This is because R defaults to making strings into factors unless you otherwise specify. Factors are useful in modeling, but you want to do analysis of the text itself, and so you should be sure that your text is not coerced to factors.
The way I approached this was to write a function that would "explode" each string into a vector, and then uses match to replace the words. The vector gets reassembled into a string again.
I'm not sure how performant this will be given your 600K records. You might look into some of the R packages that handle strings, like stringr or stringi, since they will probably have functions that do some of this. match tends to be okay on speed, but %in% can be a real beast depending on the length of the string and other factors.
# Start with options to make sure strings are represented correctly
# The rest is your original code (mildly tidied to my own standard)
options(stringsAsFactors = FALSE)
questions <- cbind(V3 = c("What is the day today?","Tom has brown eyes"))
questions <- data.frame(questions)
voc <- cbind(word = c("weather","a","blue"),
synonyms = c("day","the","brown"))
voc <- data.frame(voc)
# This function takes:
# - an input string
# - a vector of words to replace
# - a vector of the words to use as replacements
# It returns a list of the original input and the changed version
uFunc_FindAndReplace <- function(input_string,words_to_repl,repl_words) {
# Start by breaking the input string into a vector
# Note that we use [[1]] to get first list element of strsplit output
# Obviously this relies on breaking sentences by spacing
orig_words <- strsplit(x = input_string,split = " ")[[1]]
# If we find at least one of the words to replace in the original words, proceed
if(sum(orig_words %in% words_to_repl) > 0) {
# The right side selects the elements of orig_words that match words to be replaced
# The left side uses match to find the numeric index of those replacements within the words_to_repl vector
# This numeric vector is used to select the values from repl_words
# These then replace the values in orig_words
orig_words[orig_words %in% words_to_repl] <- repl_words[match(x = orig_words,table = words_to_repl,nomatch = 0)]
# We rebuild the sentence again, and return a list with original and new version
new_sent <- paste(orig_words,collapse = " ")
return(list(original = input_string,new = new_sent))
} else {
# Otherwise we return the original version since no changes are needed
return(list(original = input_string,new = input_string))
}
}
# Using do.call and rbind.data.frame, we can collapse the output of a lapply()
do.call(what = rbind.data.frame,
args = lapply(X = questions$V3,
FUN = uFunc_FindAndReplace,
words_to_repl = voc$synonyms,
repl_words = voc$word))
>
original new
1 What is the day today? What is a weather today?
2 Tom has brown eyes Tom has blue eyes

Replace strings in text based on dictionary

I am new to R and need suggestions.
I have a dataframe with 1 text field in it. I need to fix the misspelled words in that text field. To help with that, I have a second file (dictionary) with 2 columns - the misspelled words and the correct words to replace them.
How would you recommend doing it? I wrote a simple "for loop" but the performance is an issue.
The file has ~120K rows and the dictionary has ~5k rows and the program's been running for hours. The text can have a max of 2000 characters.
Here is the code:
output<-source_file$MEMO_MANUAL_TXT
for (i in 1:nrow(fix_file)) { #dictionary file
target<-paste0(" ", fix_file$change_to_target[i], " ")
replace<-paste0(" ", fix_file$target[i], " ")
output<-gsub(target, replace, output, fixed = TRUE)
I would try agrep. I'm not sure how well it scales though.
Eg.
> agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2, value = TRUE)
[1] "1 lazy"
Also check out pmatch and charmatch although I feel they won't be as useful to you.
here an example , to show #joran comment using a data.table left join. It is very fast (instantaneously here).
library(data.table)
n1 <- 120e3
n2 <- 1e3
set.seed(1)
## create vocab
tt <- outer(letters,letters,paste0)
vocab <- as.vector(outer(tt,tt,paste0))
## create the dictionary
dict <- data.table(miss=sample(vocab,n2,rep=F),
good=sample(letters,n2,rep=T),key='miss')
## the text table
orig <- data.table(miss=sample(vocab,n1,rep=TRUE),key='miss')
orig[dict]
orig[dict]
miss good
1: aakq v
2: adac t
3: adxj r
4: aeye t
5: afji g
---
1027: zvia d
1028: zygp p
1029: zyjm x
1030: zzak t
1031: zzvs q

Resources