how to extract plain text from .docx file using R

how to extract plain text from .docx file using R - r

Anyone know of anything they can recommend in order to extract just the plain text from an article with in .docx format (preferable with R) ?
Speed isn't crucial, and we could even use a website that has some API to upload and extract the files but i've been unable to find one. I need to extract the introduction, the method, the result and the conclusion I want to delete the abstract, the references, and specially the graphics and the table
thanks

You can try to use readtext library:
library(readtext)
x <- readtext("/path/to/file/myfile.docx")
# x$text will contain the plain text in the file
Variable x contains just the text without any formatting, so if you need to extract some information you need to perform string search. For example for the document you mentioned in your comment, one approach could be as follows:
library(readtext)
doc.text <- readtext("test.docx")$text
# Split text into parts using new line character:
doc.parts <- strsplit(doc.text, "\n")[[1]]
# First line in the document- the name of the Journal
journal.name <- doc.parts[1]
journal.name
# [1] "International Journal of Science and Research (IJSR)"
# Similarly we can extract some other parts from a header
issn <- doc.parts[2]
issue <- doc.parts[3]
# Search for the Abstract:
abstract.loc <- grep("Abstract:", doc.parts)[1]
# Search for the Keyword
Keywords.loc <- grep("Keywords:", doc.parts)[1]
# The text in between these 2 keywords will be abstract text:
abstract.text <- paste(doc.parts[abstract.loc:(Keywords.loc-1)], collapse=" ")
# Same way we can get Keywords text:
Background.loc <- Keywords.loc + grep("1\\.", doc.parts[-(1:Keywords.loc)])[1]
Keywords.text <- paste(doc.parts[Keywords.loc:(Background.loc-1)], collapse=" ")
Keywords.text
# [1] "Keywords: Nephronophtisis, NPHP1 deletion, NPHP4 mutations, Tunisian patients"
# Assuming that Methods is part 2
Methods.loc <- Background.loc + grep("2\\.", doc.parts[-(1:Background.loc)])[1]
Background.text <- paste(doc.parts[Background.loc:(Methods.loc-1)], collapse=" ")
# Assuming that Results is Part 3
Results.loc <- Methods.loc- + grep("3\\.", doc.parts[-(1:Methods.loc)])[1]
Methods.text <- paste(doc.parts[Methods.loc:(Results.loc-1)], collapse=" ")
# Similarly with other parts. For example for Acknowledgements section:
Ack.loc <- grep("Acknowledgements", doc.parts)[1]
Ref.loc <- grep("References", doc.parts)[1]
Ack.text <- paste(doc.parts[Ack.loc:(Ref.loc-1)], collapse=" ")
Ack.text
# [1] "6. Acknowledgements We are especially grateful to the study participants.
# This study was supported by a grant from the Tunisian Ministry of Health and
# Ministry of Higher Education ...
The exact approach depends on the common structure of all the documents you need to search through. For example if the first section is always named "Background" you can use this word for your search. However if this could sometimes be "Background" and sometimes "Introduction" then you might want to search for "1." pattern.

You should find that one of these packages will do the trick for you.
https://davidgohel.github.io/officer/
https://cran.r-project.org/web/packages/docxtractr/index.html
At the end of the day the modern Office file formats (OpenXML) are simply *.zip files containing structured XML content and so if you have well structured content then you may just want to open it that way. I would start here (http://officeopenxml.com/anatomyofOOXML.php) and you should be able to unpick the OpenXML SDK for guidance as well (https://msdn.microsoft.com/en-us/library/office/bb448854.aspx)

Pandoc is a fantastic solution for tasks like this. With a document named a.docx you would run at the command line
pandoc -f docx -t markdown -o a.md a.docx
You could then use regex tools in R to extract what you needed from the newly-created a.md, which is text. By default, images are not converted.
Pandoc is part of RStudio, by the way, so you may already have it.

You can do it with package officer:
library(officer)
example_pptx <- system.file(package = "officer", "doc_examples/example.docx")
doc <- read_docx(example_pptx)
summary_paragraphs <- docx_summary(doc)
summary_paragraphs[summary_paragraphs$content_type %in% "paragraph", "text"]
#> [1] "Title 1"
#> [2] "Lorem ipsum dolor sit amet, consectetur adipiscing elit. "
#> [3] "Title 2"
#> [4] "Quisque tristique "
#> [5] "Augue nisi, et convallis "
#> [6] "Sapien mollis nec. "
#> [7] "Sub title 1"
#> [8] "Quisque tristique "
#> [9] "Augue nisi, et convallis "
#> [10] "Sapien mollis nec. "
#> [11] ""
#> [12] "Phasellus nec nunc vitae nulla interdum volutpat eu ac massa. "
#> [13] "Sub title 2"
#> [14] "Morbi rhoncus sapien sit amet leo eleifend, vel fermentum nisi mattis. "
#> [15] ""
#> [16] ""
#> [17] ""

Related

How to remove punctuation from tokens, when quanteda tokenizes at sentence level?

It is my ultimate goal to select some sentences from a corpus which match a certain pattern & perform a sentiment analysis upon these selected cutouts from the corpus. I am trying to do all of that with a current version of quanteda in R.
I noticed that remove_punctuation does not remove punctuation when tokens is applied at the sentence-level (what = "sentence"). When decomposing the selected sentence-tokens to word-tokens for the sentiment analysis, the word-tokens will contain punctuation such as "," or ".". Dictionaries are then no longer able to match on these tokens. Reproducible example:
mypattern <- c("country", "honor")
#
txt <- c(wash1 <- "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate.",
wash2 <- "When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor.",
blind <- "Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.")
#
toks <- tokens_select(tokens(txt, what = "sentence", remove_punct = TRUE),
pattern = paste0(mypattern, collapse = "|"),
valuetype = "regex",
selection = "keep")
#
toks
For instance, the tokens in toks contain "citizens," or "arrive,". I thought about splitting the tokens back to word-tokens by tokens_split(toks, separator = " ") but separator does allow one input parameter only.
Is there a way to remove the punctuation from the sentences when tokenizing at the sentence-level?

There are better ways to go about your goal, which consists of performing sentiment analysis on just sentences from documents containing your target pattern. You can do this by first reshaping your corpus into sentences, then tokenising them, then using tokens_select() with the window argument to select only those documents containing the pattern. In this case you will set a window so large that it will include the entire sentence.
library("quanteda")
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 67.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- c("Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate.
When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor.
Lorem ipsum dolor sit amet.")
corp <- corpus(txt)
corp_sent <- corpus_reshape(corp, to = "sentences")
corp_sent
#> Corpus consisting of 3 documents.
#> text1.1 :
#> "Fellow citizens, I am again called upon by the voice of my c..."
#>
#> text1.2 :
#> "When the occasion proper for it shall arrive, I shall endeav..."
#>
#> text1.3 :
#> "Lorem ipsum dolor sit amet."
# sentiment on just the documents with the pattern
mypattern <- c("country", "honor")
toks <- tokens(corp_sent) %>%
tokens_select(pattern = mypattern, window = 10000000)
toks
#> Tokens consisting of 3 documents.
#> text1.1 :
#> [1] "Fellow" "citizens" "," "I" "am" "again"
#> [7] "called" "upon" "by" "the" "voice" "of"
#> [ ... and 11 more ]
#>
#> text1.2 :
#> [1] "When" "the" "occasion" "proper" "for" "it"
#> [7] "shall" "arrive" "," "I" "shall" "endeavor"
#> [ ... and 12 more ]
#>
#> text1.3 :
#> character(0)
# now perform sentiment analysis on the selected tokens
tokens_lookup(toks, dictionary = data_dictionary_LSD2015) %>%
dfm()
#> Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 0 docvars.
#> features
#> docs negative positive neg_positive neg_negative
#> text1.1 0 0 0 0
#> text1.2 0 5 0 0
#> text1.3 0 0 0 0
Created on 2022-03-22 by the reprex package (v2.0.1)
Note that if you to exclude the sentences that were empty, just use dfm_subset(dfmat, nfeat(dfmat) > 0) where dfmat is your saved output sentiment analysis dfm.

Longest line in text dataset

I am looking for a way to find the length of the longest line in a text file.
E.g. consider a simple dataset from the tm package.
install.packages("tm")
library(tm)
txt <- system.file("texts", "txt", package = "tm")
ovid <- VCorpus(DirSource(txt, encoding = "UTF-8"), readerControl =
list(language = "lat"))
length(ovid)
[1] 5
ovid is composed by five documents each one composed by a character vector of n elements (from 16 to 18), between which I would like to identify the longest.
I found documentation for python, C# and for bash shell but, surprisingly, I did not find anything with R. Because of that, my attempts were quite naive, with:
max(nchar(ovid))
[1] 5410
max(length(ovid))
[1] 5

Actually it's the fourth text which is the longest, once we remove the padding from whitespace. Here's how. Note that a lot of this comes from the difficulty of getting texts out of a tm (V)Corpus object, which has been asked (several times) before, for instance here.
Note that I am interpreting your question about "lines" as referring to the five documents, which are more than five lines each, but consist of multiple lines (between 16 and 18 length character vectors each). I hope I have interpreted this correctly.
texts <- sapply(ovid$content, "[[", "content")
str(texts)
## List of 5
## $ : chr [1:16] " Si quis in hoc artem populo non novit amandi," " hoc legat et lecto carmine doctus amet." " arte citae veloque rates remoque moventur," " arte leves currus: arte regendus amor." ...
## $ : chr [1:17] " quas Hector sensurus erat, poscente magistro" " verberibus iussas praebuit ille manus." " Aeacidae Chiron, ego sum praeceptor Amoris:" " saevus uterque puer, natus uterque dea." ...
## $ : chr [1:17] " vera canam: coeptis, mater Amoris, ades!" " este procul, vittae tenues, insigne pudoris," " quaeque tegis medios, instita longa, pedes." " nos venerem tutam concessaque furta canemus," ...
## $ : chr [1:17] " scit bene venator, cervis ubi retia tendat," " scit bene, qua frendens valle moretur aper;" " aucupibus noti frutices; qui sustinet hamos," " novit quae multo pisce natentur aquae:" ...
## $ : chr [1:18] " mater in Aeneae constitit urbe sui." " seu caperis primis et adhuc crescentibus annis," " ante oculos veniet vera puella tuos:" " sive cupis iuvenem, iuvenes tibi mille placebunt." ...
So here we have extracted the texts, but they are on multiple lines represented by one element each of the character vectors that each "document" comprises, and because they are verses, there is variable white space padding at the beginning and end of some of these elements. Let's trim these and just leave the text, using stringi's stri_trim_both function.
# need to trim leading and trailing whitespace
texts <- lapply(texts, stringi::stri_trim_both)
## texts[1]
## [[1]]
## [1] "Si quis in hoc artem populo non novit amandi," "hoc legat et lecto carmine doctus amet."
## [3] "arte citae veloque rates remoque moventur," "arte leves currus: arte regendus amor."
## [5] "" "curribus Automedon lentisque erat aptus habenis,"
## [7] "Tiphys in Haemonia puppe magister erat:" "me Venus artificem tenero praefecit Amori;"
## [9] "Tiphys et Automedon dicar Amoris ego." "ille quidem ferus est et qui mihi saepe repugnet:"
## [11] "" "sed puer est, aetas mollis et apta regi."
## [13] "Phillyrides puerum cithara perfecit Achillem," "atque animos placida contudit arte feros."
## [15] "qui totiens socios, totiens exterruit hostes," "creditur annosum pertimuisse senem."
# now paste them together to make a single character vector of the five documents
texts <- sapply(texts, paste, collapse = "\n")
str(texts)
## chr [1:5] "Si quis in hoc artem populo non novit amandi,\nhoc legat et lecto carmine doctus amet.\narte citae veloque rates remoque movent"| __truncated__ ...
cat(texts[1])
## Si quis in hoc artem populo non novit amandi,
## hoc legat et lecto carmine doctus amet.
## arte citae veloque rates remoque moventur,
## arte leves currus: arte regendus amor.
##
## curribus Automedon lentisque erat aptus habenis,
## Tiphys in Haemonia puppe magister erat:
## me Venus artificem tenero praefecit Amori;
## Tiphys et Automedon dicar Amoris ego.
## ille quidem ferus est et qui mihi saepe repugnet:
##
## sed puer est, aetas mollis et apta regi.
## Phillyrides puerum cithara perfecit Achillem,
## atque animos placida contudit arte feros.
## qui totiens socios, totiens exterruit hostes,
## creditur annosum pertimuisse senem.
That's looking more like it. Now we can figure out which was longest.
nchar(texts)
## [1] 600 621 644 668 622
which.max(nchar(texts))
## [1] 4

Is there in R something like the "here document" in bash?

My script contains the line
lines <- readLines("~/data")
I would like to keep the content of the file data (verbatim) in the script itself. Is there in R a "read_the_following_lines" function? Something like to the "here document" in the bash shell?

Multi-line strings are going to be as close as you get. It's definitely not the same (since you have to care about the quotes) but it does work pretty well for what you're trying to achieve (and you can do it with more than read.table):
here_lines <- 'line 1
line 2
line 3
'
readLines(textConnection(here_lines))
## [1] "line 1" "line 2" "line 3" ""
here_csv <- 'thing,val
one,1
two,2
'
read.table(text=here_csv, sep=",", header=TRUE, stringsAsFactors=FALSE)
## thing val
## 1 one 1
## 2 two 2
here_json <- '{
"a" : [ 1, 2, 3 ],
"b" : [ 4, 5, 6 ],
"c" : { "d" : { "e" : [7, 8, 9]}}
}
'
jsonlite::fromJSON(here_json)
## $a
## [1] 1 2 3
##
## $b
## [1] 4 5 6
##
## $c
## $c$d
## $c$d$e
## [1] 7 8 9
here_xml <- '<CATALOG>
<PLANT>
<COMMON>Bloodroot</COMMON>
<BOTANICAL>Sanguinaria canadensis</BOTANICAL>
<ZONE>4</ZONE>a
<LIGHT>Mostly Shady</LIGHT>
<PRICE>$2.44</PRICE>
<AVAILABILITY>031599</AVAILABILITY>
</PLANT>
<PLANT>
<COMMON>Columbine</COMMON>
<BOTANICAL>Aquilegia canadensis</BOTANICAL>
<ZONE>3</ZONE>
<LIGHT>Mostly Shady</LIGHT>
<PRICE>$9.37</PRICE>
<AVAILABILITY>030699</AVAILABILITY>
</PLANT>
</CATALOG>
'
str(xml <- XML::xmlParse(here_xml))
## Classes 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>
print(xml)
## <?xml version="1.0"?>
## <CATALOG>
## <PLANT><COMMON>Bloodroot</COMMON><BOTANICAL>Sanguinaria canadensis</BOTANICAL><ZONE>4</ZONE>a
## <LIGHT>Mostly Shady</LIGHT><PRICE>$2.44</PRICE><AVAILABILITY>031599</AVAILABILITY></PLANT>
## <PLANT>
## <COMMON>Columbine</COMMON>
## <BOTANICAL>Aquilegia canadensis</BOTANICAL>
## <ZONE>3</ZONE>
## <LIGHT>Mostly Shady</LIGHT>
## <PRICE>$9.37</PRICE>
## <AVAILABILITY>030699</AVAILABILITY>
## </PLANT>
## </CATALOG>

Pages 90f. of An introduction to R state that it is possible to write R scripts like this (I quote the example modified from there):
chem <- scan()
2.90 3.10 3.40 3.40 3.70 3.70 2.80 2.50 2.40 2.40 2.70 2.20
5.28 3.37 3.03 3.03 28.95 3.77 3.40 2.20 3.50 3.60 3.70 3.70
print(chem)
Write these lines into a file, and give it the name, say, heredoc.R. If you then execute that script non-interactively by typing in your terminal
Rscript heredoc.R
you will get the following output
Read 24 items
[1] 2.90 3.10 3.40 3.40 3.70 3.70 2.80 2.50 2.40 2.40 2.70 2.20
[13] 5.28 3.37 3.03 3.03 28.95 3.77 3.40 2.20 3.50 3.60 3.70 3.70
So you see that the data provided in the file are saved in the variable chem. The function scan(.) reads from the connection stdin() per default. stdin() refers to user input from the console in interactive mode (a call to R without specified script), but when an input script is read in, the following lines of that script are read *). The empty line after the data is important because it marks the end of the data.
This also works with tabular data:
tab <- read.table(file=stdin(), header=T)
A B C
1 1 0
2 1 0
3 2 9
summary(tab)
When using readLines(.), you must specify the number of lines read; the approach with the empty line does not work here:
txt <- readLines(con=stdin(), n=5)
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi ultricies diam
sed felis mattis, id commodo enim hendrerit. Suspendisse iaculis bibendum eros,
ut mattis eros interdum sit amet. Pellentesque condimentum eleifend blandit. Ut
commodo ligula quis varius faucibus. Aliquam accumsan tortor velit, et varius
sapien tristique ut. Sed accumsan, tellus non iaculis luctus, neque nunc
print(txt)
You can overcome this limitation by reading one line at a time until one line is empty or some other predefined string. Note however, that you may run out of memory if you read a large (>100MB) file this way, because each time you append a string to your read-in data, all the data is copied to another place in memory. See the chapter "Growing objects" in The R inferno:
txt <- c()
repeat{
x <- readLines(con=stdin(), n=1)
if(x == "") break # you can use any EOF string you want here
txt = c(txt, x)
}
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi ultricies diam
sed felis mattis, id commodo enim hendrerit. Suspendisse iaculis bibendum eros,
ut mattis eros interdum sit amet. Pellentesque condimentum eleifend blandit. Ut
commodo ligula quis varius faucibus. Aliquam accumsan tortor velit, et varius
sapien tristique ut. Sed accumsan, tellus non iaculis luctus, neque nunc
print(txt)
*) If you want to read from standard input in an R script, for example because you want to create a reusable script which can be called with any input data (Rscript reusablescript.R < input.txt or
some-data-generating-command | Rscript reusablescript.R), use not stdin() but file("stdin").

Since R v4.0.0, there is a new syntax for raw strings, as stated in changelogs, that largely allows heredocs style documents to be created.
Additionally, from help(Quotes):
The delimiter pairs [] and {} can also be used, and R can be used in place of r. For additional flexibility, a number of dashes can be placed between the opening quote and the opening delimiter, as long as the same number of dashes appear between the closing delimiter and the closing quote.
As an example, one can use (on a system with BASH shell):
file_raw_string <-
r"(#!/bin/bash
echo $#
for word in $#;
do
echo "This is the word: '${word}'."
done
exit 0
)"
writeLines(file_raw_string, "print_words.sh")
system("bash print_words.sh Word/1 w#rd2 LongWord composite-word")
or even another R script:
file_raw_string <- r"(
x <- lapply(mtcars[,1:4], mean)
cat(
paste(
"Mean for column", names(x), "is", format(x,digits = 2),
collapse = "\n"
)
)
cat("\n")
cat(r"{ - This is a raw string where \n, "", '', /, \ are allowed.}")
)"
writeLines(file_raw_string, "print_means.R")
source("print_means.R")
#> Mean for column mpg is 20
#> Mean for column cyl is 6.2
#> Mean for column disp is 231
#> Mean for column hp is 147
#> - This is a raw string where \n, "", '', /, \ are allowed.
Created on 2021-08-01 by the reprex package (v2.0.0)

A way to do multi-line strings but not worry about quotes (only backticks) you can use:
as.character(quote(`
all of the crazy " ' ) characters, except
backtick and bare backslashes that aren't
printable, e.g. \n works but a \ and c with no space between them would fail`))

What about some more recent tidyverse syntax?
SQL <- c("
SELECT * FROM patient
LEFT OUTER JOIN projectpatient ON patient.patient_id = projectpatient.patient_id
WHERE projectpatient.project_id = 16;
") %>% stringr::str_replace_all("[\r\n]"," ")

How to repeat this statement in R probably using apply()

It might seem a silly question but how to repeat this line for 152 times and I would not like to use a for loop,since later it will not be efficient with larger data sets:
reviews = as.vector(t(mydata)[,1])
mydata is a row in a data.frame and
reviews is an array of characters, also
[,1] is just the first row
The output could be a matrix or worst case a data.frame.
I tried something like this, but it did not work :
testing = apply(mydata, 1, function(x) {as.vector(t(mydata[,x]))})
Error in t(mydata)[, x] : subscript out of bounds
Thanks.
EDIT:
Quick data sample:
> reviews = as.vector(t(mydata)[,1])
> class(reviews)
[1] "character"
> length(reviews)
[1] 14
> reviews
[1] "I was involuntarily"
[2] "I was in transit"
[3] "My initial flight"
[4] "That still left"
[5] "After disembarking"
[6] "customs and proceed to my gate."
[7] "I arrived"
[8] "When my boarding pass was scanned"
[9] "No reason was given for the bump."
[10] "The UA gate staff"
[11] "I boarded Air Canada."
[12] "After arriving"
[13] "I spent 5 hours"
[14] NA
mydata data.frame:
> class(mydata)
[1] "data.frame"
> length(mydata[,1])
[1] 152
> mydata[,1]
[1] I was involuntarily... .
[2] First time... .
...
...
152 Levels: First time . ...
I have about 30.000 of these, but I want to start small, so only 152 of paragraphs split in individual sentence and put into a data.frame. Each row in the data.frame has 5-15 sentences.
I want to to be able to access each row as an array since I need to perform some action on each row of the data.frame
Packages used: plyr, sentiment(downloaded from here and installed manually)
EDIT 2:
dput(myData[1:6, 1:6])
structure(list(V1 = structure(c(70L, 41L, 94L, 114L, 47L, 49L),
.Label = c(" Air Canada",
"their service",
"hours for de-icing",
"have flown BA",
"my booking",
"If the video screen",
"Frankfurt flights",
"and another 150 lines of text data",

Here's a recommended way to ask a question, focusing on the fact that your actual data is too big, too complicated, or too private to share.
Question: how to apply a function on each row of a data.frame?
My data:
# make up some data
s <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
mydata <- as.data.frame(matrix(strsplit(s, '\\s')[[1]][1:18], nrow=3, ncol=6), stringsAsFactors=FALSE)
mydata
## V1 V2 V3 V4 V5 V6
## 1 Lorem sit adipiscing do incididunt et
## 2 ipsum amet, elit, eiusmod ut dolore
## 3 dolor consectetur sed tempor labore magna
If you have data that you can use directly, then as has been suggested multiple times in the comments, the use of dput is helpful:
mydata <- structure(list(V1 = c("Lorem", "ipsum", "dolor"),V2 = c("sit", "amet,", "consectetur"), V3 = c("adipiscing", "elit,", "sed"),
V4 = c("do", "eiusmod", "tempor"), V5 = c("incididunt", "ut", "labore"), V6 = c("et", "dolore", "magna")), .Names = c("V1",
"V2", "V3", "V4", "V5", "V6"), row.names = c(NA, -3L), class = "data.frame")
In either order, state (i) what you are trying to do, and (ii) what you have tried and how it is not working.
My desired output:
Converting a row into a vector is ... confusing. A row is already a vector, so I don't know what you are ultimately trying to do. So, I'll come up with something short an to the point: I want the words on each row to be in reverse alphabetical order, perhaps like this:
## V1 V2 V3 V4 V5 V6
## 1 sit Lorem incididunt et do adipiscing
## 2 ut ipsum elit, eiusmod dolore amet,
## 3 tempor sed magna labore dolor consectetur
This is a good time to show the code you've tried, errors you've encountered, and/or how the unerring output is not what you intended.
Answer, generically:
Several ways to do something to each row:
Use apply, though this breaks if you have numeric and character intermingled. If you try this, you'll see that the output is actually the transpose of what you may think, in which case you'll need to wrap (and all of the other *apply-based suggestions here) with t(...). It's a little confusing, but it's necessary here. Oh, and they'll all be a matrix class which can easily be converted to data.frame if needed.
ret <- apply(mydata, 1, function(r) {
do_something(r)
})
Use sapply or lapply on row indices. Note that these are returning lists or vectors of results, so you'll need to convert into whatever format you ultimately need.
ret <- sapply(1:nrow(mydata), function(i) {
do_something(mydata[i,])
})
# if you need to keep each row's results rather encapsulated, use one of the following:
ret <- sapply(1:nrow(mydata), function(i) {
do_something(mydata[i,])
}, simplify=FALSE)
ret <- lapply(1:nrow(mydata), function(i) {
do_something(mydata[i,])
})
Use foreach and iterators.
library(foreach)
library(iterators)
ret <- foreach(df=iter(mydata, by='row'), .combine=rbind) %do% {
do_something(df) # just one row of mydata this time
}
In the case of my (contrived) question, here are several ways to do it:
as.data.frame(t(apply(mydata, 1, function(r) sort(r, decreasing=TRUE))))
## V1 V2 V3 V4 V5 V6
## 1 sit Lorem incididunt et do adipiscing
## 2 ut ipsum elit, eiusmod dolore amet,
## 3 tempor sed magna labore dolor consectetur
as.data.frame(t(sapply(1:nrow(mydata), function(i) sort(mydata[i,], decreasing=TRUE))))
## same output
library(foreach)
library(iterators)
## notice the use of as.character(...), perhaps still a blasphemy
## to the structure of a data.frame
ret <- foreach(df=iter(mydata, by='row'), .combine=rbind) %do% {
sort(as.character(df), decreasing=TRUE)
}
ret
## [,1] [,2] [,3] [,4] [,5] [,6]
## result.1 "sit" "Lorem" "incididunt" "et" "do" "adipiscing"
## result.2 "ut" "ipsum" "elit," "eiusmod" "dolore" "amet,"
## result.3 "tempor" "sed" "magna" "labore" "dolor" "consectetur"

extract semi-structured text from Word documents

I want to text-mine a set of files based on the below form. I can create a corpus where each file is a document (using tm), but I'm thinking it might be better to create a corpus where each section in the 2nd form table was a document having the following meta data:
Author : John Smith
DateTimeStamp: 2013-04-18 16:53:31
Description :
Heading : Current Focus
ID : Smith-John_e.doc Current Focus
Language : en_CA
Origin : Smith-John_e.doc
Name : John Smith
Title : Manager
TeamMembers : Joe Blow, John Doe
GroupLeader : She who must be obeyed
where Name, Title, TeamMembers and GroupLeader are extracted from the first table on the form. In this way, each chunk of text to be analyzed would maintain some of its context.
What is the best way to approach this? I can think of 2 ways:
somehow parse the corpus I have into child corpora.
somehow parse the document into subdocuments and make a corpus from those.
Any pointers would be much appreciated.
This is the form:
Here is an RData file of a corpus with 2 documents. exc[[1]] came from a .doc and exc[[2]] came from a docx. They both used the form above.

Here's a quick sketch of a method, hopefully it might provoke someone more talented to stop by and suggest something more efficient and robust... Using the RData file in your question, I found that the doc and docx files have slightly different structures and so require slightly different approaches (though I see in the metadata that your docx is 'fake2.txt', so is it really docx? I see in your other Q that you used a converter outside of R, that must be why it's txt).
library(tm)
First get custom metadata for the doc file. I'm no regex expert, as you can see, but it's roughly 'get rid of trailing and leading spaces' then 'get rid of "Word"', then get rid of punctuation...
# create User-defined local meta data pairs
meta(exc[[1]], type = "corpus", tag = "Name1") <- gsub("^\\s+|\\s+$","", gsub("Name", "", gsub("[[:punct:]]", '', exc[[1]][3])))
meta(exc[[1]], type = "corpus", tag = "Title") <- gsub("^\\s+|\\s+$","", gsub("Title", "", gsub("[[:punct:]]", '', exc[[1]][4])))
meta(exc[[1]], type = "corpus", tag = "TeamMembers") <- gsub("^\\s+|\\s+$","", gsub("Team Members", "", gsub("[[:punct:]]", '', exc[[1]][5])))
meta(exc[[1]], type = "corpus", tag = "ManagerName") <- gsub("^\\s+|\\s+$","", gsub("Name of your", "", gsub("[[:punct:]]", '', exc[[1]][7])))
Now have a look at the result
# inspect
meta(exc[[1]], type = "corpus")
Available meta data pairs are:
Author :
DateTimeStamp: 2013-04-22 13:59:28
Description :
Heading :
ID : fake1.doc
Language : en_CA
Origin :
User-defined local meta data pairs are:
$Name1
[1] "John Doe"
$Title
[1] "Manager"
$TeamMembers
[1] "Elise Patton Jeffrey Barnabas"
$ManagerName
[1] "Selma Furtgenstein"
Do the same for the docx file
# create User-defined local meta data pairs
meta(exc[[2]], type = "corpus", tag = "Name2") <- gsub("^\\s+|\\s+$","", gsub("Name", "", gsub("[[:punct:]]", '', exc[[2]][2])))
meta(exc[[2]], type = "corpus", tag = "Title") <- gsub("^\\s+|\\s+$","", gsub("Title", "", gsub("[[:punct:]]", '', exc[[2]][4])))
meta(exc[[2]], type = "corpus", tag = "TeamMembers") <- gsub("^\\s+|\\s+$","", gsub("Team Members", "", gsub("[[:punct:]]", '', exc[[2]][6])))
meta(exc[[2]], type = "corpus", tag = "ManagerName") <- gsub("^\\s+|\\s+$","", gsub("Name of your", "", gsub("[[:punct:]]", '', exc[[2]][8])))
And have a look
# inspect
meta(exc[[2]], type = "corpus")
Available meta data pairs are:
Author :
DateTimeStamp: 2013-04-22 14:06:10
Description :
Heading :
ID : fake2.txt
Language : en
Origin :
User-defined local meta data pairs are:
$Name2
[1] "Joe Blow"
$Title
[1] "Shift Lead"
$TeamMembers
[1] "Melanie Baumgartner Toby Morrison"
$ManagerName
[1] "Selma Furtgenstein"
If you have a large number of documents then a lapply function that includes these meta functions would be the way to go.
Now that we've got the custom metadata, we can subset the documents to exclude that part of the text:
# create new corpus that excludes part of doc that is now in metadata. We just use square bracket indexing to subset the lines that are the second table of the forms (slightly different for each doc type)
excBody <- Corpus(VectorSource(c(paste(exc[[1]][13:length(exc[[1]])], collapse = ","),
paste(exc[[2]][9:length(exc[[2]])], collapse = ","))))
# get rid of all the white spaces
excBody <- tm_map(excBody, stripWhitespace)
Have a look:
inspect(excBody)
A corpus with 2 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
|CURRENT RESEARCH FOCUS |,| |,|Lorem ipsum dolor sit amet, consectetur adipiscing elit. |,|Donec at ipsum est, vel ullamcorper enim. |,|In vel dui massa, eget egestas libero. |,|Phasellus facilisis cursus nisi, gravida convallis velit ornare a. |,|MAIN AREAS OF EXPERTISE |,|Vestibulum aliquet faucibus tortor, sed aliquet purus elementum vel. |,|In sit amet ante non turpis elementum porttitor. |,|TECHNOLOGY PLATFORMS, INSTRUMENTATION EMPLOYED |,| Vestibulum sed turpis id nulla eleifend fermentum. |,|Nunc sit amet elit eu neque tincidunt aliquet eu at risus. |,|Cras tempor ipsum justo, ut blandit lacus. |,|INDUSTRY PARTNERS (WITHIN THE PAST FIVE YEARS) |,| Pellentesque facilisis nisl in libero scelerisque mattis eu quis odio. |,|Etiam a justo vel sapien rhoncus interdum. |,|ANTICIPATED PARTICIPATION IN PROGRAMS, EITHER APPROVED OR UNDER DEVELOPMENT |,|(Please include anticipated percentages of your time.) |,| Proin vitae ligula quis enim vulputate sagittis vitae ut ante. |,|ADDITIONAL ROLES, DISTINCTIONS, ACADEMIC QUALIFICATIONS AND NOTES |,|e.g., First Aid Responder, Other languages spoken, Degrees, Charitable Campaign |,|Canvasser (GCWCC), OSH representative, Social Committee |,|Sed nec tellus nec massa accumsan faucibus non imperdiet nibh. |,,
[[2]]
CURRENT RESEARCH FOCUS,,* Lorem ipsum dolor sit amet, consectetur adipiscing elit.,* Donec at ipsum est, vel ullamcorper enim.,* In vel dui massa, eget egestas libero.,* Phasellus facilisis cursus nisi, gravida convallis velit ornare a.,MAIN AREAS OF EXPERTISE,* Vestibulum aliquet faucibus tortor, sed aliquet purus elementum vel.,* In sit amet ante non turpis elementum porttitor. ,TECHNOLOGY PLATFORMS, INSTRUMENTATION EMPLOYED,* Vestibulum sed turpis id nulla eleifend fermentum.,* Nunc sit amet elit eu neque tincidunt aliquet eu at risus.,* Cras tempor ipsum justo, ut blandit lacus.,INDUSTRY PARTNERS (WITHIN THE PAST FIVE YEARS),* Pellentesque facilisis nisl in libero scelerisque mattis eu quis odio.,* Etiam a justo vel sapien rhoncus interdum.,ANTICIPATED PARTICIPATION IN PROGRAMS, EITHER APPROVED OR UNDER DEVELOPMENT ,(Please include anticipated percentages of your time.),* Proin vitae ligula quis enim vulputate sagittis vitae ut ante.,ADDITIONAL ROLES, DISTINCTIONS, ACADEMIC QUALIFICATIONS AND NOTES,e.g., First Aid Responder, Other languages spoken, Degrees, Charitable Campaign Canvasser (GCWCC), OSH representative, Social Committee,* Sed nec tellus nec massa accumsan faucibus non imperdiet nibh.,,
Now the documents are ready for text mining, with the data from the upper table moved out of the document and into the document metadata.
Of course all of this depends on the documents being highly regular. If there are different numbers of lines in the first table in each doc, then the simple indexing method might fail (give it a try and see what happens) and something more robust will be needed.
UPDATE: A more robust method
Having read the question a little more carefully, and got a bit more education about regex, here's a method that is more robust and doesn't depend on indexing specific lines of the documents. Instead, we use regular expressions to extract text from between two words to make the metadata and split the document
Here's how we make the User-defined local meta data (a method to replace the one above)
library(gdata) # for the trim function
txt <- paste0(as.character(exc[[1]]), collapse = ",")
# inspect the document to identify the words on either side of the string
# we want, so 'Name' and 'Title' are on either side of 'John Doe'
extract <- regmatches(txt, gregexpr("(?<=Name).*?(?=Title)", txt, perl=TRUE))
meta(exc[[1]], type = "corpus", tag = "Name1") <- trim(gsub("[[:punct:]]", "", extract))
extract <- regmatches(txt, gregexpr("(?<=Title).*?(?=Team)", txt, perl=TRUE))
meta(exc[[1]], type = "corpus", tag = "Title") <- trim(gsub("[[:punct:]]","", extract))
extract <- regmatches(txt, gregexpr("(?<=Members).*?(?=Supervised)", txt, perl=TRUE))
meta(exc[[1]], type = "corpus", tag = "TeamMembers") <- trim(gsub("[[:punct:]]","", extract))
extract <- regmatches(txt, gregexpr("(?<=your).*?(?=Supervisor)", txt, perl=TRUE))
meta(exc[[1]], type = "corpus", tag = "ManagerName") <- trim(gsub("[[:punct:]]","", extract))
# inspect
meta(exc[[1]], type = "corpus")
Available meta data pairs are:
Author :
DateTimeStamp: 2013-04-22 13:59:28
Description :
Heading :
ID : fake1.doc
Language : en_CA
Origin :
User-defined local meta data pairs are:
$Name1
[1] "John Doe"
$Title
[1] "Manager"
$TeamMembers
[1] "Elise Patton Jeffrey Barnabas"
$ManagerName
[1] "Selma Furtgenstein"
Similarly we can extract the sections of your second table into separate
vectors and then you can make them into documents and corpora or just work
on them as vectors.
txt <- paste0(as.character(exc[[1]]), collapse = ",")
CURRENT_RESEARCH_FOCUS <- trim(gsub("[[:punct:]]","", regmatches(txt, gregexpr("(?<=CURRENT RESEARCH FOCUS).*?(?=MAIN AREAS OF EXPERTISE)", txt, perl=TRUE))))
[1] "Lorem ipsum dolor sit amet consectetur adipiscing elit Donec at ipsum est vel ullamcorper enim In vel dui massa eget egestas libero Phasellus facilisis cursus nisi gravida convallis velit ornare a"
MAIN_AREAS_OF_EXPERTISE <- trim(gsub("[[:punct:]]","", regmatches(txt, gregexpr("(?<=MAIN AREAS OF EXPERTISE).*?(?=TECHNOLOGY PLATFORMS, INSTRUMENTATION EMPLOYED)", txt, perl=TRUE))))
[1] "Vestibulum aliquet faucibus tortor sed aliquet purus elementum vel In sit amet ante non turpis elementum porttitor"
And so on. I hope that's a bit closer to what you're after. If not, it might be best to break down your task into a set of smaller, more focused questions, and ask them separately (or wait for one of the gurus to stop by this question!).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how to extract plain text from .docx file using R - r

Related

How to remove punctuation from tokens, when quanteda tokenizes at sentence level?

Longest line in text dataset

Is there in R something like the "here document" in bash?

How to repeat this statement in R probably using apply()

extract semi-structured text from Word documents

Categories

Resources