I have a set of short text files that I was able to combine into one datatest so that each file is in a row.
I am trying to summarize the content using the LSAfun package using the generic function argument genericSummary(text,k,split=c(".","!","?"),min=5,breakdown=FALSE,...)
This works very well for single text entry, however it does not in my case. In the package explanation it says that the text input should be "A character vector of length(text) = 1 specifiying the text to be summarized".
Please see this example
# Generate a dataset example (text examples were copied from wikipedia):
dd = structure(list(text = structure(1:2, .Label = c("Forest gardening, a forest-based food production system, is the world's oldest form of gardening.[1] Forest gardens originated in prehistoric times along jungle-clad river banks and in the wet foothills of monsoon regions. In the gradual process of families improving their immediate environment, useful tree and vine species were identified, protected and improved while undesirable species were eliminated. Eventually foreign species were also selected and incorporated into the gardens.[2]\n\nAfter the emergence of the first civilizations, wealthy individuals began to create gardens for aesthetic purposes. Ancient Egyptian tomb paintings from the New Kingdom (around 1500 BC) provide some of the earliest physical evidence of ornamental horticulture and landscape design; they depict lotus ponds surrounded by symmetrical rows of acacias and palms. A notable example of ancient ornamental gardens were the Hanging Gardens of Babylon—one of the Seven Wonders of the Ancient World —while ancient Rome had dozens of gardens.\n\nWealthy ancient Egyptians used gardens for providing shade. Egyptians associated trees and gardens with gods, believing that their deities were pleased by gardens. Gardens in ancient Egypt were often surrounded by walls with trees planted in rows. Among the most popular species planted were date palms, sycamores, fir trees, nut trees, and willows. These gardens were a sign of higher socioeconomic status. In addition, wealthy ancient Egyptians grew vineyards, as wine was a sign of the higher social classes. Roses, poppies, daisies and irises could all also be found in the gardens of the Egyptians.\n\nAssyria was also renowned for its beautiful gardens. These tended to be wide and large, some of them used for hunting game—rather like a game reserve today—and others as leisure gardens. Cypresses and palms were some of the most frequently planted types of trees.\n\nGardens were also available in Kush. In Musawwarat es-Sufra, the Great Enclosure dated to the 3rd century BC included splendid gardens. [3]\n\nAncient Roman gardens were laid out with hedges and vines and contained a wide variety of flowers—acanthus, cornflowers, crocus, cyclamen, hyacinth, iris, ivy, lavender, lilies, myrtle, narcissus, poppy, rosemary and violets[4]—as well as statues and sculptures. Flower beds were popular in the courtyards of rich Romans.",
"The Middle Ages represent a period of decline in gardens for aesthetic purposes. After the fall of Rome, gardening was done for the purpose of growing medicinal herbs and/or decorating church altars. Monasteries carried on a tradition of garden design and intense horticultural techniques during the medieval period in Europe. Generally, monastic garden types consisted of kitchen gardens, infirmary gardens, cemetery orchards, cloister garths and vineyards. Individual monasteries might also have had a \"green court\", a plot of grass and trees where horses could graze, as well as a cellarer's garden or private gardens for obedientiaries, monks who held specific posts within the monastery.\n\nIslamic gardens were built after the model of Persian gardens and they were usually enclosed by walls and divided in four by watercourses. Commonly, the centre of the garden would have a reflecting pool or pavilion. Specific to the Islamic gardens are the mosaics and glazed tiles used to decorate the rills and fountains that were built in these gardens.\n\nBy the late 13th century, rich Europeans began to grow gardens for leisure and for medicinal herbs and vegetables.[4] They surrounded the gardens by walls to protect them from animals and to provide seclusion. During the next two centuries, Europeans started planting lawns and raising flowerbeds and trellises of roses. Fruit trees were common in these gardens and also in some, there were turf seats. At the same time, the gardens in the monasteries were a place to grow flowers and medicinal herbs but they were also a space where the monks could enjoy nature and relax.\n\nThe gardens in the 16th and 17th century were symmetric, proportioned and balanced with a more classical appearance. Most of these gardens were built around a central axis and they were divided into different parts by hedges. Commonly, gardens had flowerbeds laid out in squares and separated by gravel paths.\n\nGardens in Renaissance were adorned with sculptures, topiary and fountains. In the 17th century, knot gardens became popular along with the hedge mazes. By this time, Europeans started planting new flowers such as tulips, marigolds and sunflowers."
), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
# This code is trying to generate the summary into another column:
dd$sum = genericSummary(dd$text,k=1)
This gives an error Error in strsplit(text, split = split, fixed = T) : non-character argument
I believe this is due to using a variable not a single text
My expected output is to have the generated summary for each row located in a corresponding second column called dd$sum
I tried using as.vector(dd$text) but this does not work. (I feel it still combines the output into one row).
I tried to read a bit about map function from purrr but was not able to apply it in this case and was wondering if someone with experience in r programming can help.
Also if you know a way to do this part using text summary packages eg lexrankr, this will also work. I tried their code from here but still not working. Text summarization in R language
Thank you
Check class(dd$text). It's a factor, which is not a character.
The following works:
library(dplyr)
library(purrr)
dd %>%
mutate(text = as.character(text)) %>%
mutate(sum = map(text, genericSummary, k = 1))
Related
I have large text documents (150K lines per document; 160 documents). Then I read them in as a large VCorpus and convert them to a dataframe it runs quite quickly. Now, I want to separate each sentence in rows and remove those that do not contain a certain keyword. Then I run the code R crashes. If I try it with one document the code runs for approximately 10 minutes.
The data (text) looks like this (just with much more text):
| Company1 | Company1 hat engaged in child safety in South Africa. The Community rewards this with great recognition. Charity is a big part of the companies culture.
| Company2 | Company2 opened up several factories in the that do not pay the minimum wage. Affordable Housing is not one of thair priorities.
There is also a small example structure below.
library(readr)
library(qdap)
library(tm)
corp <- VCorpus(DirSource("test"))
text <- as.data.frame(corp)
names(text)[1] <- 'doc_id'
names(text)[2] <- 'text'
text <- dput(text)
text <- structure(list(doc_id = c("Company1.txt", "Company2.txt", "Company3.txt"
), text = c("Acucap's market cap on listing in 2002 was a meagre R372m, which would make it one of the smallest stocks in the real estate sector in today's terms. This may be a reflection of how the real estate sector of the JSE has evolved over the past 10 years. But, it is also amazing that Acucap - at a current market cap of more than R7bn - has managed to retain much of the same level of entrepreneurial spirit as at the time of its listing. Still at the helm is sector veteran Paul Theodosiou. The team has grown since listing. The core operational supporting team is Jonathan Rens, Gavin Jones and Craig Kotze, and the finance team is led by founding chief financial officer Baden Marlow. Acucap's physical portfolio is concentrated in retail, which comprises 74%, offices make up 24% and industrial takes the remaining 2%. Some of the larger retail properties include Festival Mall (the largest single asset in the portfolio) in Kempton Park, Bayside Mall in Tableview and Keywest shopping centre in Krugersdorp. The office portfolio includes assets like Mowbray Golf Park in Pinelands in the Cape, the Microsoft offices in Bryanston and 82 Grayston Drive in the Sandton CBD. Though small in Acucap's portfolio, the industrial portfolio is of a high quality and offers good scope for growth as expansion continues at the N1 Business Park in Midrand and the Montague Business Park in Cape Town. Acucap's other investments include a 17,2% shareholding in Sycom Property Fund (a fellow listed real estate property company) as well as a 100% shareholding in Sycom's management company. Sycom's other significant shareholder is Hyprop Investments. The latter's 33,9% stake in Sycom has been the subject of much speculation over the past few years as the ownership structure has created somewhat of an impasse. Acucap has, however, stated its strategic intent to retain Sycom as a separate fund. Company1 hat engaged in child safety in South Africa. The Community rewards this with great recognition. Charity is a big part of the companies culture.",
"Australian copper producer Aditya Birla Minerals Ltd. said June 16 that there is a risk that the overall reserves at its Nifty copper mine in Western Australia may be adversely affected due to the effects of an earlier ground collapse at the mine. The company also said it will be unable to provide its annual reserve update until all investigative activities are finalized. Aditya Birla is expecting site costs for the June quarter to be in the range of A$17 million to A$19 million, higher than the earlier estimate of A$12 million to A$15 million due to higher-than-expected mine activity, unplanned maintenance on critical infrastructure and additional employee costs. The company was forced to halt operations at the Nifty mine in March following suspected underground subsidence, which led to the standing down of 350 mine employees until at least July 15. Phase two of probe drilling, which is being undertaken as part of efforts to assess mine safety, is progressing ahead of schedule and is now expected to be completed by the end of June instead of mid-July. Aditya Birla said preliminary observations from this second phase are similar to those from phase one drilling, in that there is less water being intersected between levels 16 and 20 than previously expected. Areas with potential to self-propagate into new sinkholes are being reviewed and, as Aditya Birla expected, the rock mass strength has deteriorated on top of mined-out areas. Management has begun an initial review of the results to identify new gaps and/or the need for additional confirmatory drilling. The first phase of the seismic system installation has been commissioned and is fully functional, albeit limited in coverage to upper parts of the mine. The pit has been dewatered with residual mud left at the pit sump. Aditya Birla added that while surface cracking has continued to develop in the area affected by the sinkhole, this was expected and does not present a hazard. However, washouts will occur after heavy rainfall, leading to widening and deepening of the cracks.",
"In preparation for the potential completion of the merger with AGL Resources Inc., the board of directors of Nicor Inc. announced a special pro rata dividend. The dividend is contingent upon the merger being completed prior to Nicor's next scheduled dividend record date, Dec. 31, to ensure that \"shareholders continue to receive a dividend at the current rate until the closing of the merger,\" the company said. In a Nov. 1 news release, Nicor said its board of directors declared a pro rata dividend of 0.5 cent per share per day from Oct. 1 until and including the day before the merger effective date. The dividend is the daily equivalent of the current quarterly dividend rate of 46.5 cents per share. It will be paid to Nicor shareholders of record at the close of business on the day immediately before the effective merger date, which is expected in the fourth quarter. \"The dividend will be paid as soon as practical following the completion of the merger,\" Nicor said. \"This pro rata dividend is in addition to the previously announced Nov. 1 dividend that was paid to shareholders of record Sept. 30. Following the merger closing, AGL Resources is expected to pay a dividend at the rate of $0.004945055 per share, per day for the remainder of the current quarterly dividend period. These dividend payment scenarios, together with a similar plan announced today by AGL Resources, will synchronize the companies' dividends as of the merger effective date in accordance with the merger agreement.\" If the merger is not completed by Dec. 31, Nicor shareholders of record Dec. 31 will receive the regular quarterly dividend of 46.5 cents per share, payable Feb. 1, 2012, and a new pro rata dividend will be announced to ensure that shareholders receive a dividend at the current rate until the merger is completed. Company2 opened up several factories in the that do not pay the minimum wage. Affordable Housing is not one of thair priorities."
)), class = "data.frame", row.names = c(NA, 3L))
options(stringsAsFactors = FALSE)
Sys.setlocale('LC_ALL','C')
library(dplyr)
library(tidyr)
sckeywords <- c("Affordable Housing", "Benefit The Masses",
"Charitability", "Charitable", "Charitably", " Charities ", " Charity ")
pat <- paste0(sckeywords, collapse = '|')
text2 <- (text) %>%
separate_rows(text, sep = '\\.\\s*') %>%
slice({
tmp <- grep(pat, text, ignore.case = TRUE)
sort(unique(c(tmp-1, tmp, tmp + 1)))
})
Can I make it run faster somehow without needing more hardware capacity?
I have 16 GB of RAM and a 4 core CPU (i5-10210U).
There are several ways to increase the speed:
I would use data.table instead of dplyr for this amount of data
text <- structure(list(doc_id = c("Company1.txt", "Company2.txt", "Company3.txt"
), text = c("Acucap's market cap on listing in 2002 was a meagre R372m, which would make it one of the smallest stocks in the real estate sector in today's terms. This may be a reflection of how the real estate sector of the JSE has evolved over the past 10 years. But, it is also amazing that Acucap - at a current market cap of more than R7bn - has managed to retain much of the same level of entrepreneurial spirit as at the time of its listing. Still at the helm is sector veteran Paul Theodosiou. The team has grown since listing. The core operational supporting team is Jonathan Rens, Gavin Jones and Craig Kotze, and the finance team is led by founding chief financial officer Baden Marlow. Acucap's physical portfolio is concentrated in retail, which comprises 74%, offices make up 24% and industrial takes the remaining 2%. Some of the larger retail properties include Festival Mall (the largest single asset in the portfolio) in Kempton Park, Bayside Mall in Tableview and Keywest shopping centre in Krugersdorp. The office portfolio includes assets like Mowbray Golf Park in Pinelands in the Cape, the Microsoft offices in Bryanston and 82 Grayston Drive in the Sandton CBD. Though small in Acucap's portfolio, the industrial portfolio is of a high quality and offers good scope for growth as expansion continues at the N1 Business Park in Midrand and the Montague Business Park in Cape Town. Acucap's other investments include a 17,2% shareholding in Sycom Property Fund (a fellow listed real estate property company) as well as a 100% shareholding in Sycom's management company. Sycom's other significant shareholder is Hyprop Investments. The latter's 33,9% stake in Sycom has been the subject of much speculation over the past few years as the ownership structure has created somewhat of an impasse. Acucap has, however, stated its strategic intent to retain Sycom as a separate fund. Company1 hat engaged in child safety in South Africa. The Community rewards this with great recognition. Charity is a big part of the companies culture.",
"Australian copper producer Aditya Birla Minerals Ltd. said June 16 that there is a risk that the overall reserves at its Nifty copper mine in Western Australia may be adversely affected due to the effects of an earlier ground collapse at the mine. The company also said it will be unable to provide its annual reserve update until all investigative activities are finalized. Aditya Birla is expecting site costs for the June quarter to be in the range of A$17 million to A$19 million, higher than the earlier estimate of A$12 million to A$15 million due to higher-than-expected mine activity, unplanned maintenance on critical infrastructure and additional employee costs. The company was forced to halt operations at the Nifty mine in March following suspected underground subsidence, which led to the standing down of 350 mine employees until at least July 15. Phase two of probe drilling, which is being undertaken as part of efforts to assess mine safety, is progressing ahead of schedule and is now expected to be completed by the end of June instead of mid-July. Aditya Birla said preliminary observations from this second phase are similar to those from phase one drilling, in that there is less water being intersected between levels 16 and 20 than previously expected. Areas with potential to self-propagate into new sinkholes are being reviewed and, as Aditya Birla expected, the rock mass strength has deteriorated on top of mined-out areas. Management has begun an initial review of the results to identify new gaps and/or the need for additional confirmatory drilling. The first phase of the seismic system installation has been commissioned and is fully functional, albeit limited in coverage to upper parts of the mine. The pit has been dewatered with residual mud left at the pit sump. Aditya Birla added that while surface cracking has continued to develop in the area affected by the sinkhole, this was expected and does not present a hazard. However, washouts will occur after heavy rainfall, leading to widening and deepening of the cracks.",
"In preparation for the potential completion of the merger with AGL Resources Inc., the board of directors of Nicor Inc. announced a special pro rata dividend. The dividend is contingent upon the merger being completed prior to Nicor's next scheduled dividend record date, Dec. 31, to ensure that \"shareholders continue to receive a dividend at the current rate until the closing of the merger,\" the company said. In a Nov. 1 news release, Nicor said its board of directors declared a pro rata dividend of 0.5 cent per share per day from Oct. 1 until and including the day before the merger effective date. The dividend is the daily equivalent of the current quarterly dividend rate of 46.5 cents per share. It will be paid to Nicor shareholders of record at the close of business on the day immediately before the effective merger date, which is expected in the fourth quarter. \"The dividend will be paid as soon as practical following the completion of the merger,\" Nicor said. \"This pro rata dividend is in addition to the previously announced Nov. 1 dividend that was paid to shareholders of record Sept. 30. Following the merger closing, AGL Resources is expected to pay a dividend at the rate of $0.004945055 per share, per day for the remainder of the current quarterly dividend period. These dividend payment scenarios, together with a similar plan announced today by AGL Resources, will synchronize the companies' dividends as of the merger effective date in accordance with the merger agreement.\" If the merger is not completed by Dec. 31, Nicor shareholders of record Dec. 31 will receive the regular quarterly dividend of 46.5 cents per share, payable Feb. 1, 2012, and a new pro rata dividend will be announced to ensure that shareholders receive a dividend at the current rate until the merger is completed. Company2 opened up several factories in the that do not pay the minimum wage. Affordable Housing is not one of thair priorities."
)), class = "data.frame", row.names = c(NA, 3L))
library(data.table)
sckeywords <- c("Affordable Housing", "Benefit The Masses",
"Charitability", "Charitable", "Charitably", " Charities ", " Charity ")
pat <- paste0(sckeywords, collapse = '|')
setDT(text)
text <- text[, .(text = unlist(tstrsplit(text, "\\.\\s*", type.convert = TRUE))),
by = "doc_id"]
text <- text[grepl(pat, text)]
You can try to speed up the regex. One possibility is to create a trie, e.g. with trieregex. In your case, it would be:
trie_1 <- "(?:Charitab(?:ility|l[ey])|\\ Charit(?:ies\\ |y\\ )|Benefit\\ The\\ Masses|Affordable\\ Housing)"
text <- text[grepl(trie_1, text)]
# or without the white space around the last two words
trie_2 <- "(?:Charit(?:ab(?:ility|l[ey])|ies|y)|Benefit\\ The\\ Masses|Affordable\\ Housing)"
text <- text[grepl(trie_2, text)]
However, in the short example, using fixed patterns was actually the fastest (checked with microbenchmark), so you could try this:
text <- text[grepl(sckeywords[1], text, fixed = TRUE) | grepl(sckeywords[2], text, fixed = TRUE) |
grepl(sckeywords[3], text, fixed = TRUE) | grepl(sckeywords[4], text, fixed = TRUE) |
grepl(sckeywords[5], text, fixed = TRUE) | grepl(sckeywords[6], text, fixed = TRUE) |
grepl(sckeywords[7], text, fixed = TRUE)]
If this is still too slow, try to use other tools like awk or sed.
Use parallel processing. You could try to parallelise everything:
make a list of the files to read in and send parts of this file list to the different cores/workers. They then read in the files so that you don't need to send big files to different cores (you can also first try to skip this and start with your original code if the file is not too big)
make a clean data.table and filter it
return the data and join everything
When you run into memory issues, you could also try to save the cleaned data.tables separately in the workers instead of returning them.
For a parallel framework, I'd recommend foreach or the futureverse
Edit
Here is a regex that ignores the case of the first character. However, I'm not an expert on regexes, so maybe there are better ways to optimise the regex!
trie_2 <- "(?:[Cc]harit(?:ab(?:ility|l[ey])|ies|y)|[Bb]enefit\\ [Tt]he\\ [Mm]asses|[Aa]ffordable\\ [Hh]ousing)"
I am trying to extract all the text from the word "Conclusion" till end of text using stringr
#example data:
cc= structure(list(text = structure(1:2, .Label = c("Forest gardening, a forest-based food production system, is the world's oldest form of gardening.[1] Forest gardens originated in prehistoric times along jungle-clad river banks and in the wet foothills of monsoon regions. In the gradual process of families improving their immediate environment, useful tree and vine species were identified, protected and improved while undesirable species were eliminated. Eventually foreign species were also selected and incorporated into the gardens.[2]\n\nAfter the emergence of the first civilizations, wealthy individuals began to create gardens for aesthetic purposes. Ancient Egyptian tomb paintings from the New Kingdom (around 1500 BC) provide some of the earliest physical evidence of ornamental horticulture and landscape design; they depict lotus ponds surrounded by symmetrical rows of acacias and palms. A notable example of ancient ornamental gardens were the Hanging Gardens of Babylon—one of the Seven Wonders of the Ancient World —while ancient Rome had dozens of gardens.\n\nWealthy ancient Egyptians used gardens for providing shade. Egyptians associated trees and gardens with gods, believing that their deities were pleased by gardens. Gardens in ancient Egypt were often surrounded by walls with trees planted in rows. Among the most popular species planted were date palms, sycamores, fir trees, nut trees, and willows. These gardens were a sign of higher socioeconomic status. In addition, wealthy ancient Egyptians grew vineyards, as wine was a sign of the higher social classes. Roses, poppies, daisies and irises could all also be found in the gardens of the Egyptians.\n\nAssyria was also renowned for its beautiful gardens. These tended to be wide and large, some of them used for hunting game—rather like a game reserve today—and others as leisure gardens. Cypresses and palms were some of the most frequently planted types of trees.\n\nGardens were also available in Kush. In Musawwarat es-Sufra, the Great Enclosure dated to the 3rd century BC included splendid gardens. [3]\n\n Conclusion: Ancient Roman gardens were laid out with hedges and vines and contained a wide variety of flowers—acanthus, cornflowers, crocus, cyclamen, hyacinth, iris, ivy, lavender, lilies, myrtle, narcissus, poppy, rosemary and violets[4]—as well as statues and sculptures. Flower beds were popular in the courtyards of rich Romans.",
"The Middle Ages represent a period of decline in gardens for aesthetic purposes. After the fall of Rome, gardening was done for the purpose of growing medicinal herbs and/or decorating church altars. Monasteries carried on a tradition of garden design and intense horticultural techniques during the medieval period in Europe. Generally, monastic garden types consisted of kitchen gardens, infirmary gardens, cemetery orchards, cloister garths and vineyards. Individual monasteries might also have had a \"green court\", a plot of grass and trees where horses could graze, as well as a cellarer's garden or private gardens for obedientiaries, monks who held specific posts within the monastery.\n\nIslamic gardens were built after the model of Persian gardens and they were usually enclosed by walls and divided in four by watercourses. Commonly, the centre of the garden would have a reflecting pool or pavilion. Specific to the Islamic gardens are the mosaics and glazed tiles used to decorate the rills and fountains that were built in these gardens.\n\nBy the late 13th century, rich Europeans began to grow gardens for leisure and for medicinal herbs and vegetables.[4] They surrounded the gardens by walls to protect them from animals and to provide seclusion. During the next two centuries, Europeans started planting lawns and raising flowerbeds and trellises of roses. Fruit trees were common in these gardens and also in some, there were turf seats. At the same time, the gardens in the monasteries were a place to grow flowers and medicinal herbs but they were also a space where the monks could enjoy nature and relax.\n\nThe gardens in the 16th and 17th century were symmetric, proportioned and balanced with a more classical appearance. Most of these gardens were built around a central axis and they were divided into different parts by hedges. Commonly, gardens had flowerbeds laid out in squares and separated by gravel paths.\n\nGardens in Renaissance were adorned with sculptures, topiary and fountains. Conclusion \n In the 17th century, knot gardens became popular along with the hedge mazes. By this time, Europeans started planting new flowers such as tulips, marigolds and sunflowers."
), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
I am new to regex but I did some search and used this but it is not working
pattern = "(?<=Conclusion)\S+(?:\s+\S+)$"
stringr::str_extract(cc$text, pattern)
the expected output is:
Ancient Roman gardens were laid out with hedges and vines and contained a wide variety of flowers—acanthus, cornflowers, crocus, cyclamen, hyacinth, iris, ivy, lavender, lilies, myrtle, narcissus, poppy, rosemary and violets[4]—as well as statues and sculptures. Flower beds were popular in the courtyards of rich Romans.
In the 17th century, knot gardens became popular along with the hedge mazes. By this time, Europeans started planting new flowers such as tulips, marigolds and sunflowers.
Any guidance?
thanks a lot in advance
We can use the pattern to match 'Conclusion' followed by ':' and a space or 'Conclusion' followed by space and next line and match all the characters after that (.*)
library(stringr)
str_extract(cc$text, "(?<=Conclusion: ).*|(?<=Conclusion \n ).*")
-output
#[1] "Ancient Roman gardens were laid out with hedges and vines and contained a wide variety of flowers—acanthus, cornflowers, crocus, cyclamen, hyacinth, iris, ivy, lavender, lilies, myrtle, narcissus, poppy, rosemary and violets[4]—as well as statues and sculptures. Flower beds were popular in the courtyards of rich Romans."
#[2] "In the 17th century, knot gardens became popular along with the hedge mazes. By this time, Europeans started planting new flowers such as tulips, marigolds and sunflowers."
Or use trimws after extracting
trimws(str_extract(cc$text, "(?<=Conclusion)\\s*.*"),
whitespace = "\\s*[:\n]\\s*")
You can remove everything till 'Conclusion' part in the string. Using base R this can be done as :
sub('.*Conclusion\\s*[:\n]\\s*', '', cc$text)
#[1] "Ancient Roman gardens were laid out with hedges and vines and contained a wide variety of flowers—acanthus, cornflowers, crocus, cyclamen, hyacinth, iris, ivy, lavender, lilies, myrtle, narcissus, poppy, rosemary and violets[4]—as well as statues and sculptures. Flower beds were popular in the courtyards of rich Romans."
#[2] "In the 17th century, knot gardens became popular along with the hedge mazes. By this time, Europeans started planting new flowers such as tulips, marigolds and sunflowers."
I am working on classification of some documents and a number of the documents have large sections of similar (and usually irrelevant) text. I would like to identify and remove those similar sections, as I believe I may be able to make a better model.
An example would be proposals by an organization, each of which contains the same paragraph regarding the organization's mission statement and purpose.
A couple points which make it difficult:
similar sections are not known ahead of time, making a fixed pattern inappropriate
could be located anywhere in the documents, documents do not have consistent structure
the pattern could be many characters long, e.g. 3000+ characters
I don't want to remove every similar word, just large sections
I don't want to identify which strings are similar, rather I want to remove the similar sections.
I've considered regex and looked through some packages like stringr, strdist, and the base functions, but these utilities seem useful if you already know the pattern and the pattern is much shorter, or if the documents have a similar structure. In my case the text could be structured differently and the pattern is not predefined, but rather whatever is similar between the documents.
I considered making and comparing lists of 3000-grams for each document but this didn't seem feasible or easy to implement.
Below is an example of a complete solution, but really I am not even sure how to approach this problem, so information in that direction would be useful as well.
Example code
doc_a <- "this document discusses african hares in the northern sahara. african hares
are the most common land dwelling mammal in the northern sahara. crocodiles eat
african hares. this text is from a book written for the foundation for education
in northern africa."
doc_b <- "this document discusses the nile. The nile delta is in egypt. the nile is the
longest river in the world. the nile has lots of crocodiles. crocodiles and
alligators are different. crocodiles eat african hares. crocodiles are the most common
land dwelling reptile in egypt. this text is from a book written for the foundation
for education in northern africa."
# this function would trim similar sections of 6 or more words in length
# (length in characters is also acceptable)
trim_similar(doc_a, doc_b, 6)
Output
[1] "this document discusses african hares in the northern sahara. african hares
mammal in the northern sahara. crocodiles eat african hares."
[2] "this document discusses the nile. The nile delta is in egypt. the nile is the
longest river in the world. the nile has lots of crocodiles. crocodiles and alligators
are different. crocodiles eat african hares. crocodiles reptile in egypt."
I have one page story (i.e. text data), I need to use Bayesian network on that story and analyse the same. Could someone tell me whether it is possible in R? If yes, that how to proceed?
The objective of the analysis is - Extract Action Descriptions from
Narrative Text.
The data considered for analysis -
Krishna’s Dharam-shasthra to Arjuna:
The Gita is the conversation between Krishna and Arjuna leading up to the battle.
Krishna emphasised on two terms: Karma and Dharma. He told Arjun that this was a righteous war; a war of Dharma. Dharma is the way of righteousness or a set of rules and laws laid down. The Kauravas were on the side of Adharma and had broken rules and laws and hence Arjun would have to do his Karma to uphold Dharma.
Arjuna doesn't want to fight. He doesn't understand why he has to shed his family's blood for a kingdom that he doesn't even necessarily want. In his eyes, killing his evil and killing his family is the greatest sin of all. He casts down his weapons and tells Krishna he will not fight. Krishna, then, begins the systematic process of explaining why it is Arjuna's dharmic duty to fight and how he must fight in order to restore his karma.
Krishna first explains the samsaric cycle of birth and death. He says there is no true death of the soul simply a sloughing of the body at the end of each round of birth and death. The purpose of this cycle is to allow a person to work off their karma, accumulated through lifetimes of action. If a person completes action selflessly, in service to God, then they can work off their karma, eventually leading to a dissolution of the soul, the achievement of enlightenment and vijnana, and an end to the samsaric cycle. If they act selfishly, then they keep accumulating debt, putting them further and further into karmic debt.
What I want is - post tagger to separate verbs, nouns etc. and then create a meaningful network using that.
The steps that should be followed in pre-processing are:
syntactic processing (post tagger)
SRL algorithm (semantic role labelling of characters of the story)
conference resolution
Using all of the above I need to create a knowledge database and create a Bayesian network.
This is what I have tried so far to get post tagger:
txt <- c("As the years went by, they remained isolated in their city. Their numbers increased by freeing women from slavery.
Doom would come to the world in the form of Ares the god of war and the Son of Zeus. Ares was unhappy with the gods as he wanted to prove just how foul his father’s creation was. Hence, he decided to corrupt the mortal men created by Zeus. Fearing his wrath upon the world Zeus decided to create the God killer in order to stop Ares. He then commanded Hippolyta to mould a baby from the sand and clay of the island. Then the five goddesses went back into the Underworld, drawing out the last soul that remained in the Well and giving it incredible powers. The soul was merged with the clay and became flesh. Hippolyta had her daughter and named her Diana, Princess of the Amazons, the first child born on Paradise Island.
Each of the six members of the Greek Pantheon granted Diana a gift: Demeter, great strength; Athena, wisdom and courage; Artemis, a hunter's heart and a communion with animals; Aphrodite, beauty and a loving heart; Hestia, sisterhood with fire; Hermes, speed and the power of flight. Diana was also gifted with a sword, the Lasso of truth and the bracelets of penance as weapons to defeat Ares.
The time arrived when Diana, protector of the Amazons and mankind was sent to the Man's World to defeat Ares and rid the mortal men off his corruption. Diana believed that only love could truly rid the world of his influence. Diana was successfully able to complete the task she was sent out by defeating Ares and saving the world.
")
writeLines(txt, tf <- tempfile())
library(stringi)
library(cleanNLP)
cnlp_init_tokenizers()
anno <- cnlp_annotate(tf)
names(anno)
get_token(anno)
cnlp_init_spacy()
anno <- cnlp_annotate(tf)
get_token(anno)
cnlp_init_corenlp()
Does anyone know how to replicate the (pg_trgm) postgres trigram similarity score from the similarity(text, text) function in R? I am using the stringdist package and would rather use R to calculate these on a matrix of text strings in a .csv file than run a bunch of postgresql quires.
Running similarity(string1, string2) in postgres give me a number score between 0 and 1.
I tired using the stringdist package to get a score but I think I still need to divide the code below by something.
stringdist(string1, string2, method="qgram",q = 3 )
Is there a way to replicate the pg_trgm score with the stringdist package or another way to do this in R?
An example would be getting the similarity score between the description of a book and the description of a genre like science fiction. For example, if I have two book descriptions and the using the similarity score of
book 1 = "Area X has been cut off from the rest of the continent for decades. Nature has reclaimed the last vestiges of human civilization. The first expedition returned with reports of a pristine, Edenic landscape; the second expedition ended in mass suicide, the third expedition in a hail of gunfire as its members turned on one another. The members of the eleventh expedition returned as shadows of their former selves, and within weeks, all had died of cancer. In Annihilation, the first volume of Jeff VanderMeer's Southern Reach trilogy, we join the twelfth expedition.
The group is made up of four women: an anthropologist; a surveyor; a psychologist, the de facto leader; and our narrator, a biologist. Their mission is to map the terrain, record all observations of their surroundings and of one anotioner, and, above all, avoid being contaminated by Area X itself.
They arrive expecting the unexpected, and Area X delivers—they discover a massive topographic anomaly and life forms that surpass understanding—but it’s the surprises that came across the border with them and the secrets the expedition members are keeping from one another that change everything."
book 2= "From Wall Street to Main Street, John Brooks, longtime contributor to the New Yorker, brings to life in vivid fashion twelve classic and timeless tales of corporate and financial life in America
What do the $350 million Ford Motor Company disaster known as the Edsel, the fast and incredible rise of Xerox, and the unbelievable scandals at GE and Texas Gulf Sulphur have in common? Each is an example of how an iconic company was defined by a particular moment of fame or notoriety; these notable and fascinating accounts are as relevant today to understanding the intricacies of corporate life as they were when the events happened.
Stories about Wall Street are infused with drama and adventure and reveal the machinations and volatile nature of the world of finance. John Brooks’s insightful reportage is so full of personality and critical detail that whether he is looking at the astounding market crash of 1962, the collapse of a well-known brokerage firm, or the bold attempt by American bankers to save the British pound, one gets the sense that history repeats itself.
Five additional stories on equally fascinating subjects round out this wonderful collection that will both entertain and inform readers . . . Business Adventures is truly financial journalism at its liveliest and best."
genre 1 = "Science fiction is a genre of fiction dealing with imaginative content such as futuristic settings, futuristic science and technology, space travel, time travel, faster than light travel, parallel universes, and extraterrestrial life. It often explores the potential consequences of scientific and other innovations, and has been called a "literature of ideas".[1] Authors commonly use science fiction as a framework to explore politics, identity, desire, morality, social structure, and other literary themes."
How can I get a similarity score for the description of each book against the description of the science fiction genre like pg_trgm using an R script?
How about something like this?
library(textcat)
?textcat_xdist
# Compute cross-distances between collections of n-gram profiles.
round(textcat_xdist(
list(
text1="hello there",
text2="why hello there",
text3="totally different"
),
method="cosine"),
3)
# text1 text2 text3
#text1 0.000 0.078 0.731
#text2 0.078 0.000 0.739
#text3 0.731 0.739 0.000