I don't know if this is the right place, but if possible, could you help me split a text into several sentences using R.
I have a database that contains the description of activities that employees perform. I would like to split this text into several sentences and then extract the verb-noun pair from each sentence.
I can do this line by line, but as there are many lines it would take forever, so I would like to know if you guys know how to do this for the entire column.
You guys can see the database in: https://docs.google.com/spreadsheets/d/1NiMj37q8_hJhuNFCiQcjO6UBvI9_-OM4/edit?usp=sharing&ouid=115543599430411372875&rtpof=true&sd=true
I can do it one by one as the following code, but I would like to do it for the entire description
library(udpipe)
> docs <- "Determine and formulate policies and provide overall direction of companies or private and public sector organizations within guidelines set up by a board of directors or similar governing body. Plan, direct, or coordinate operational activities at the highest level of management with the help of subordinate executives and staff managers."
docs <- setNames(docs, "doc1")
anno <- udpipe(docs, object = "english", udpipe_model_repo = "bnosac/udpipe.models.ud")
anno <- cbind_dependencies(anno, type = "parent")
subset(anno, upos_parent %in% c("NOUN", "VERB") & upos %in% c("NOUN", "VERB"),
+select = c("doc_id", "paragraph_id", "sentence_id", "token", "token_parent", "dep_rel","upos", "upos_parent"))
I would like to create several pdf files in rmarkdown.
This is a sample of my data:
mydata <- data.frame(First = c("John", "Hui", "Jared","Jenner"), Second = c("Smith", "Chang", "Jzu","King"), Sport = c("Football","Ballet","Ballet","Football"), Age = c("12", "13", "12","13"), submission = c("Microbes may be the friends of future colonists living off the land on the moon, Mars or elsewhere in the solar system and aiming to establish self-sufficient homes.
Space colonists, like people on Earth, will need what are known as rare earth elements, which are critical to modern technologies. These 17 elements, with daunting names like yttrium, lanthanum, neodymium and gadolinium, are sparsely distributed in the Earth’s crust. Without the rare earths, we wouldn’t have certain lasers, metallic alloys and powerful magnets that are used in cellphones and electric cars.", "But mining them on Earth today is an arduous process. It requires crushing tons of ore and then extracting smidgens of these metals using chemicals that leave behind rivers of toxic waste water.
Experiments conducted aboard the International Space Station show that a potentially cleaner, more efficient method could work on other worlds: let bacteria do the messy work of separating rare earth elements from rock.", "“The idea is the biology is essentially catalyzing a reaction that would occur very slowly without the biology,” said Charles S. Cockell, a professor of astrobiology at the University of Edinburgh.
On Earth, such biomining techniques are already used to produce 10 to 20 percent of the world’s copper and also at some gold mines; scientists have identified microbes that help leach rare earth elements out of rocks.", "Blank"))
With help from the community, I was able to arrive at a cool rmarkdown solution that would create a single html file, with all the data I want.
This is saved as Essay to Word.Rmd
```{r echo = FALSE}
# using data from above
# mydata <- mydata
# Define template (using column names from data.frame)
template <- "**First:** `r First` **Second:** `r Second` <br>
**Age:** `r Age`
**Submission** <br>
`r Submission`"
# Now process the template for each row of the data.frame
src <- lapply(1:nrow(mydata), function(i) {
knitr::knit_child(text=template, envir=mydata[i, ], quiet=TRUE)
})
```
# Print result to document
`r knitr::knit_child(text=unlist(src))`
```
This creates a single file:
I would like to create a single html (or preferably PDF file) for each "sport" listed in the data. So I would have all the submissions for students who do "Ballet" in one file, and a separate file with all the submissions of students who play football.
I have been looking a few different solutions, and I found this to be the most helpful:
R Knitr PDF: Is there a posssibility to automatically save PDF reports (generated from .Rmd) through a loop?
Following suite, I created a separate R script to loop through and subset the data by sport:
Unfortunately, this is creating a separate file with ALL the students, not just those who belong to that sport.
for (sport in unique(mydata$Sport)){
subgroup <- mydata[mydata$Sport == sport,]
render("Essay to Word.Rmd",output_file = paste0('report.',sport, '.html'))
}
Any idea what might be going on with this code above?
Is it possible to directly create these files as PDF docs instead of html? I know I can click on each file to save them as pdf after the fact, but I will have 40 different sports files to work with.
Is is possible to add a thin line between each "submission" essay within a file?
Any help would be great, thank you!!!
This could be achieved via a parametrized report like so:
Add parameters for the data and e.g. the type of sport to your Rmd
Inside the lapply pass your subgroup dataset to render via argument params
You can add horizontal lines via ***
If you want pdf then use output_format="pdf_document". Additionally to render your document I had to switch the latex engine via output_options
Rmd:
---
params:
data: null
sport: null
---
```{r echo = FALSE}
# using data from above
data <- params$data
# Define template (using column names from data.frame)
template <- "
***
**First:** `r First` **Second:** `r Second` <br>
**Age:** `r Age`
**Submission** <br>
`r Submission`"
# Now process the template for each row of the data.frame
src <- lapply(1:nrow(data), function(i) {
knitr::knit_child(text=template, envir=data[i, ], quiet=TRUE)
})
```
# Print result to document. Sport: `r params$sport`
`r knitr::knit_child(text=unlist(src))`
R Script:
mydata <- data.frame(First = c("John", "Hui", "Jared","Jenner"),
Second = c("Smith", "Chang", "Jzu","King"),
Sport = c("Football","Ballet","Ballet","Football"),
Age = c("12", "13", "12","13"),
Submission = c("Microbes may be the friends of future colonists living off the land on the moon, Mars or elsewhere in the solar system and aiming to establish self-sufficient homes.
Space colonists, like people on Earth, will need what are known as rare earth elements, which are critical to modern technologies. These 17 elements, with daunting names like yttrium, lanthanum, neodymium and gadolinium, are sparsely distributed in the Earth’s crust. Without the rare earths, we wouldn’t have certain lasers, metallic alloys and powerful magnets that are used in cellphones and electric cars.", "But mining them on Earth today is an arduous process. It requires crushing tons of ore and then extracting smidgens of these metals using chemicals that leave behind rivers of toxic waste water.
Experiments conducted aboard the International Space Station show that a potentially cleaner, more efficient method could work on other worlds: let bacteria do the messy work of separating rare earth elements from rock.", "“The idea is the biology is essentially catalyzing a reaction that would occur very slowly without the biology,” said Charles S. Cockell, a professor of astrobiology at the University of Edinburgh.
On Earth, such biomining techniques are already used to produce 10 to 20 percent of the world’s copper and also at some gold mines; scientists have identified microbes that help leach rare earth elements out of rocks.", "Blank"))
for (sport in unique(mydata$Sport)){
subgroup <- mydata[mydata$Sport == sport,]
rmarkdown::render("test.Rmd", output_format = "html_document", output_file = paste0('report.', sport, '.html'), params = list(data = subgroup, sport = sport))
rmarkdown::render("test.Rmd", output_format = "pdf_document", output_options = list(latex_engine = "xelatex"), output_file = paste0('report.', sport, '.pdf'), params = list(data = subgroup, sport = sport))
}
In order to directly create a pdf from your rmd-file , you could use the following function in a separate R script where your data is loaded, and then use map from the purrr package to iterate over the data (in the rmd-file the output must be set to pdf_document):
library(tidyverse)
library(lazyeval)
get_report <- function(sport){
sport <- enquo(sport)
mydata <- mydata %>%
filter(Sport == !!sport)
render("test.rmd", output_file = paste('report_', as_name(sport), '.pdf', sep=''))
}
map(as.vector(data$Sport), get_report)
Hope that is what you are looking for?
Extracting the quarterly income statement with the tabulizer package and converting it to tabular form.
# 2017 Q3 Report
telia_url = "http://www.teliacompany.com/globalassets/telia-
company/documents/reports/2017/q3/telia-company-q3-2017-en"
telialists = extract_tables(telia_url)
teliatest1 = as.data.frame(telialists[22])
#2009 Q3#
telia_url2009 = "http://www.teliacompany.com/globalassets/telia-
company/documents/reports/2009/q3/teliasonera-q3-2009-report-en.pdf"
telialists2009 = extract_tables(telia_url2009)
teliatest2 = as.data.frame(telialists2009[9])
Interested only in the Condensed Consolidated Statements of Comprehensive Income table. This string is exact or very similar for all historical reports.
Above, for the 2017 report, list #22 was the correct table. However, since 2009 report had a different layout, #9 was the correct for that particular report.
What would be a clever solution to make this function dynamic, depending on where the string (or substring) of "Condensed Consolidated Statements of Comprehensive Income" is located?
Perhaps using the tm package to find the relative position?
Thanks
You could use pdftools to find the page you're interested in.
For instance a function like this one should do the job:
get_table <- function(url) {
txt <- pdftools::pdf_text(url)
p <- grep("condensed consolidated statements.{0,10}comprehensive income",
txt,
ignore.case = TRUE)[1]
L <- tabulizer::extract_tables(url, pages = p)
i <- which.max(lengths(L))
data.frame(L[[i]])
}
The first step is to read all the pages in the character vector txt. Then grep allows you to find the first page looking like the one you want (I inserted .{0,10} to allow a maximum of ten characters like spaces or newlines in the middle of the title).
Using tabulizer, you can extract the list L of all tables located on this page, which should be much faster than extracting all the tables of the document, as you did. Your table is probably the biggest on that page, hence the which.max.
I'm using this "RISmed" library to do some query of my gene or protein of interest and the output comes with pubmed ID basically, but most of the times it consist of non-specific hits as well which are not my interest. As I can only see the pubmed ID I have to manually put those returned ID and search them in NCBI to see if the paper is of my interest or not.
Question: Is there a way to to return the abstract of the paper or summary sort of along with its pumed ID , which can be implemented in R?
If anyone can help it would be really great..
Using the example from the manuals we need EUtilsGet function.
library(RISmed)
search_topic <- 'copd'
search_query <- EUtilsSummary(search_topic, retmax = 10,
mindate = 2012, maxdate = 2012)
summary(search_query)
# see the ids of our returned query
QueryId(search_query)
# get actual data from PubMed
records <- EUtilsGet(search_query)
class(records)
# store it
pubmed_data <- data.frame('Title' = ArticleTitle(records),
'Abstract' = AbstractText(records))
I am interested in a large amount of free data from FRED, OECD, BIS, World-bank, etc. All data are macro-economic in nature. I am not really interested in stock price data.
I would essentially like to construct a CSV table with all symbols available for me in quantmod and Quandl. I am almost certain this table would be useful for others as well.
Symbol, Title, Units, Frequency
X X X X
Y Y Y Y
I found a similar question which has no answer.
How can i see all available data series from quantmod package?
Is there a way to do this instead of searching FRED, OECD and manually on quandl on country by country and variable by variable level?
Thank you.
Doe, on Quandl you can query the API for a dataset search, as described in this link:
https://www.quandl.com/docs/api#dataset-search
If you're using the Quandl package, you could use the Quandl.search function to build your query as specified in the above link. Here's the query from the link above using Quandl.search:
Quandl.search(query = "crude oil", page = 1, source = "DOE", silent = TRUE)