I have a data frame with several information about patients. I created a loop with R to process each information and write them to a docx file using ReporteRs, but with this loop I obtain as much docx as subjects I have, instead I would like to have 1 unique docx with all information one after the other.
this is the df
Surname Name Born Subject Place
Halls Ben 09/08/2019 3387502 S.Jeorge
Beck David 12/08/2019 1319735 S.Jeorge
Essimy Daniel 12/08/2019 3387789 S.Jeorge
Rich Maria 12/08/2019 3307988 S.Agatha
and this is the code I have written
dfY2 <- read.table("file.txt",header=T)
for(i in 1:nrow(dfY2)) {
my_title <- pot('Exam', textProperties(font.weight = "bold",font.size=12, font.family="Times New Roman"))
row1<-pot("Surname and Name",textProperties(font.weight="bold"))+" "+pot(dfY2[i,1])+" "+pot(dfY2[i,2])+" "+pot("Born",textProperties(font.weight="bold"))+pot(dfY2[i,3])
row2<-pot("SubjectID",textProperties(font.weight="bold"))+" "+pot(dfY2[i,4])+pot("Place",textProperties(font.weight="bold"))+" "+pot(dfY2[i,5])
doc<-docx("Template.docx")%>%
addParagraph(my_title, par.properties=parProperties( text.align = "center"))%>%
addParagraph(c(""))%>%
addParagraph(row1)%>%
addParagraph(row2)%>%
writeDoc(doc,file = paste0(dfY2[i,1],"output.docx"))
}
So, in this way, I obtain several outputs, while I would like to write all the rows one after the other for each subject in only a single doc.
What can I do?
thanks
First of all, I would recommend using the newer package officer from the same author because ReporteRs is not anymore maintained.
To your question: you need to create the 'docx' object before the loop and save it after the loop (eventually you want to add the title before the loop as well):
doc <- docx("Template.docx")
for(i in 1:nrow(dfY2)) {
...
doc <- doc %>%
addParagraph(my_title, par.properties=parProperties( text.align = "center")) %>%
addParagraph(c("")) %>%
addParagraph(row1) %>%
addParagraph(row2)
}
writeDoc(doc, file ="output.docx")
Related
I want to import multiple pdf-files into R but per page there are 4 columns, a header/footer line and a table of contents.
For purpose of text mining I want to remove them from my file or character vector.
Right now I am using two functions to read in the files. The first one is pdf_text because it keeps the pages but can't deal with the 4 columns. The second one is extract_text, this one on its own doesn't keep the pages but can deal with the column structure (and is decently with occuring tables) .
But neither one of them is able to remove the table of contents (as far as I have tried).
My data set is not exactly minimal but otherwise I had some problems with the data structures. Here a working code:
################ relevant code ##############
library(pdftools)
library(tidyverse)
library(tabulizer)
files_name <- "Nachhaltigkeit 2021.pdf"
file_url <- c("https://www.allianz.com/content/dam/onemarketing/azcom/Allianz_com/sustainability/documents/Allianz_Group_Sustainability_Report_2021-web.pdf", "https://www.allianz.com/content/dam/onemarketing/azcom/Allianz_com/investor-relations/en/results-reports/annual-report/ar-2021/en-Allianz-Group-Annual-Report-2021.pdf")
reports_list <- lapply(file_url, pdf_text)
createTibble <- function(){
tibble_together <- NULL
#for all files
for(i in 1:length(files_name)){
page_nr <- length(reports_list[[i]])
tib <- tibble(report = rep(files_name[i], page_nr), page = 1:page_nr, text = gsub("\r\n", " ",
extract_text(files_name[[i]], pages = 1:page_nr)))
tibble_together <- rbind(tibble_together, tib)
}
return(tibble_together)
}
reports_df <- createTibble()
############ code for problem visualization ###############
reports_df <- reports_df %>% unnest_tokens(output = word, input = text, token = "words")
#e.g this part contains the table of contents which is not intended
(reports_df %>% filter(page == 34, report == "Nachhaltigkeit 2021.pdf"))$word[832:885]
Thanks for your help in advance
PS: it's my first question so if you need sth. let me know.
And I know that the function createTibble probably isn't optimal. But that's not my primary concern.
I have a data.frame (dim: 100 x 1) containing a list of url links, each url looks something like this: https:blah-blah-blah.com/item/123/index.do .
The list (the list is a data.frame called my_list with 100 rows and a single column named col and is in character format $ col: chr) together looks like this :
1 "https:blah-blah-blah.com/item/123/index.do"
2" https:blah-blah-blah.com/item/124/index.do"
3 "https:blah-blah-blah.com/item/125/index.do"
etc.
I am trying to import each of these url's into R and collectively save the object as an object that is compatible for text mining procedures.
I know how to successfully convert each of these url's (that are on the list) manually:
library(pdftools)
library(tidytext)
library(textrank)
library(dplyr)
library(tm)
#1st document
url <- "https:blah-blah-blah.com/item/123/index.do"
article <- pdf_text(url)
Once this "article" file has been successfully created, I can inspect it:
str(article)
chr [1:13]
It looks like this:
[1] "abc ....."
[2] "def ..."
etc etc
[15] "ghi ...:
From here, I can successfully save this as an RDS file:
saveRDS(article, file = "article_1.rds")
Is there a way to do this for all 100 articles at the same time? Maybe with a loop?
Something like :
for (i in 1:100) {
url_i <- my_list[i,1]
article_i <- pdf_text(url_i)
saveRDS(article_i, file = "article_i.rds")
}
If this was written correctly, it would save each article as an RDS file (e.g. article_1.rds, article_2.rds, ... article_100.rds).
Would it then be possible to save all these articles into a single rds file?
Please note that list is not a good name for an object, as this will
temporarily overwrite the list() function. I think it is usually good
to name your variables according to their content. Maybe url_df would be
a good name.
library(pdftools)
#> Using poppler version 20.09.0
library(tidyverse)
url_df <-
data.frame(
url = c(
"https://www.nimh.nih.gov/health/publications/autism-spectrum-disorder/19-mh-8084-autismspecdisordr_152236.pdf",
"https://www.nimh.nih.gov/health/publications/my-mental-health-do-i-need-help/20-mh-8134-mymentalhealth-508_161032.pdf"
)
)
Since the urls are already in a data.frame we could store the text data in
an aditional column. That way the data will be easily available for later
steps.
text_df <-
url_df %>%
mutate(text = map(url, pdf_text))
Instead of saving each text in a separate file we can now store all of the data
in a single file:
saveRDS(text_df, "text_df.rds")
For historical reasons for loops are not very popular in the R community.
base R has the *apply() function family that provides a functional
approach to iteration. The tidyverse has the purrr package and the map*()
functions that improve upon the *apply() functions.
I recommend taking a look at
https://purrr.tidyverse.org/ to learn more.
It seems that there are certain url's in your data which are not valid pdf files. You can wrap it in tryCatch to handle the errors. If your dataframe is called df with url column in it, you can do :
library(pdftools)
lapply(seq_along(df$url), function(x) {
tryCatch({
saveRDS(pdf_text(df$url[x]), file = sprintf('article_%d.rds', x)),
},error = function(e) {})
})
So say you have a data.frame called my_df with a column that contains your URLs of pdf locations. As by your comments, it seems that some URLs lead to broken PDFs. You can use tryCatch in these cases to report back which links were broken and check manually what's wrong with these links.
You can do this in a for loop like this:
my_df <- data.frame(url = c(
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf", # working pdf
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pfd" # broken pdf
))
# make some useful new columns
my_df$id <- seq_along(my_df$url)
my_df$status <- NA
for (i in my_df$id) {
my_df$status[i] <- tryCatch({
message("downloading ", i) # put a status message on screen
article_i <- suppressMessages(pdftools::pdf_text(my_df$url[i]))
saveRDS(article_i, file = paste0("article_", i, ".rds"))
"OK"
}, error = function(e) {return("FAILED")}) # return the string FAILED if something goes wrong
}
my_df$status
#> [1] "OK" "FAILED"
I included a broken link in the example data on purpose to showcase how this would look.
Alternatively, you can use a loop from the apply family. The difference is that instead of iterating through a vector and applying the same code until the end of the vector, *apply takes a function, applies it to each element of a list (or objects which can be transformed to lists) and returns the results from each iteration in one go. Many people find *apply functions confusing at first because usually people define and apply functions in one line. Let's make the function more explicit:
s_download_pdf <- function(link, id) {
tryCatch({
message("downloading ", id) # put a status message on screen
article_i <- suppressMessages(pdftools::pdf_text(link))
saveRDS(article_i, file = paste0("article_", id, ".rds"))
"OK"
}, error = function(e) {return("FAILED")})
}
Now that we have this function, let's use it to download all files. I'm using mapply which iterates through two vectors at once, in this case the id and url columns:
my_df$status <- mapply(s_download_pdf, link = my_df$url, id = my_df$id)
my_df$status
#> [1] "OK" "FAILED"
I don't think it makes much of a difference which approach you choose as the speed will be bottlenecked by your internet connection instead of R. Just thought you might appreciate the comparison.
The title doesn't really do my question justice, because there are probably a few ways to skin this cat. But I picked one approach and went with it. This is what I'm working with:
I've pulled all the metadata for a particular study in the NCBI database using the "Send to:" option on their interface and downloading a .txt file.
In total, I have ~23k samples, each with up to 609 unique questions and answers from a questionnaire totaling 8M+ obs of 1 variable when read as a .csv. To my dismay, the metadata are irregular. Some samples have 140 associated key/value pairs. Others have 492. I've included a header of a sample below.
1: qiita_sid_10317:10317.BLANK1.6H.GUELPH
Identifiers: BioSample: SAMEA4790059; SRA: ERS2609990
Organism: metagenome
Attributes:
/Alias="qiita_sid_10317:10317.BLANK1.6H.GUELPH"
/description="American Gut control"
/ENA checklist="ERC000011"
/INSDC center alias="UCSDMI"
/INSDC center name="University of California San Diego Microbiome Initiative"
/INSDC first public="2018-07-13T17:03:10Z"
/INSDC last update="2018-07-13T14:50:03Z"
/INSDC status="public"
/SRA accession="ERS2609990"
I've tried (including but not limited to):
Read .txt file (adding a delimiter hasn't made a difference, am I missing something here?)
I've tried reading the data using various delimiters
I've even removed the header data in Sublime Text, leaving only "Attributes:" and the "/"-delimited key/value pairs in order to mess with the column that way
I've split the column found all unique values in col1 to maybe create a df from scratch, etc etc.
Can't seem to get past the cleaning steps:
samples <- read.csv("~/biosample_result_full.txt")
samples_split <- cSplit(samples, splitCols = sample$Colname, sep = "=")
samples_split$Attributes_1 <- gsub(" ", "_", samples_split$Attributes_1)
questions <- unique(samples_split$Attributes_1)
Ideally, each sample and associated metadata would be transformed into rows, with each "Attribute"/question as the column name.
Any help is greatly appreciated.
I see that the website you've linked to, allows fot the option to export data to xml. I strongly suggest to do so. R can hande/parse xml-files very efficient.
When I download the first three results from that site to a file biosample_result.xml , it's easy to process using the xml2-package
library( xml2 )
library( magrittr )
doc <- read_xml( "./biosample_result.xml")
#gret all BioSample nodes
BioSample.Nodes <- xml_find_all( doc, "//BioSample")
#build a data.frame
data.frame(
sample_name = xml_find_first( BioSample.Nodes , ".//Id[#db='SRA']") %>% xml_text(),
stringsAsFactors = FALSE )
# sample_name
# 1 ERS2609990
# 2 ERS2609989
# 3 ERS2609988
So if you can use the XML, you will just have to use the right xpath-syntax to get the data/nodes you need, into the columns you want...
In the exmaple above, I extracted (from each BioSample-node) the first ID-node with attribute db equals SRA, and stored the result in the co0lumn sample_name.
Still assuming you can use the xml-data.
If you are lokking for all attributes into one df, you need the functions from purrr, so just load the entire tidyverse
library( tidyverse )
df <- xml_find_all( doc, "//BioSample") %>%
map_df(~{
set_names(
xml_find_all(.x, ".//Attribute") %>% xml_text(),
xml_find_all(.x, ".//Attribute") %>% xml_attr( "attribute_name" )
) %>%
as.list() %>%
flatten_df()
})
will result in a df like this
I am trying to build a data frame with book id, title, author, rating, collection, start and finish date from the LibraryThing api with my personal data. I am able to get a nested list fairly easily, and I have figured out how to build a data frame with everything but the dates (perhaps in not the best way but it works). My issue is with the dates.
The list I'm working with normally has 20 elements, but it adds the startfinishdates element only if I added dates to the book in my account. This is causing two issues:
If it was always there, I could extract it like everything else and it would have NA most of the time, and I could use cbind to get it lined up correctly with the other information
When I extract it using the name, and get an object with less elements, I don't have a way to join it back to everything else (it doesn't have the book id)
Ultimately, I want to build this data frame and an answer that tells me how to pull out the book id and associate it with each startfinishdate so I can join on book id is acceptable. I would just add that to the code I have.
I'm also open to learning a better approach from the jump and re-designing the entire thing as I have not worked with lists much in R and what I put together was after much trial and error. I do want to use R though, as ultimately I am going to use this to create an R Markdown page for my web site (for instance, a plot that shows finish dates of books).
You can run the code below and get the data (no api key required).
library(jsonlite)
library(tidyverse)
library(assertr)
data<-fromJSON("http://www.librarything.com/api_getdata.php?userid=cau83&key=392812157&max=450&showCollections=1&responseType=json&showDates=1")
books.lst<-data$books
#create df from json
create.df<-function(item){
df<-map_df(.x=books.lst,~.x[[item]])
df2 <- t(df)
return(df2)
}
ids<-create.df(1)
titles<-create.df(2)
ratings<-create.df(12)
authors<-create.df(4)
#need to get the book id when i build the date df's
startdates.df<-map_df(.x=books.lst,~.x$startfinishdates) %>% select(started_stamp,started_date)
finishdates.df<-map_df(.x=books.lst,~.x$startfinishdates) %>% select(finished_stamp,finished_date)
collections.df<-map_df(.x=books.lst,~.x$collections)
#from assertr: will create a vector of same length as df with all values concatenated
collections.v<-col_concat(collections.df, sep = ", ")
#assemble df
books.df<-as.data.frame(cbind(ids,titles,authors,ratings,collections.v))
names(books.df)<-c("ID","Title","Author","Rating","Collections")
books.df<-books.df %>% mutate(ID=as.character(ID),Title=as.character(Title),Author=as.character(Author),
Rating=as.character(Rating),Collections=as.character(Collections))
This approach is outside the tidyverse meta-package. Using base-R you can make it work using the following code.
Map will apply the user defined function to each element of data$books which is provided in the argument and extract the required fields for your data.frame. Reduce will take all the individual dataframes and merge them (or reduce) to a single data.frame booksdf.
library(jsonlite)
data<-fromJSON("http://www.librarything.com/api_getdata.php?userid=cau83&key=392812157&max=450&showCollections=1&responseType=json&showDates=1")
booksdf=Reduce(function(x,y){rbind(x,y)},
Map(function(x){
lenofelements = length(x)
if(lenofelements>20){
if(!is.null(x$startfinishdates$started_date)){
started_date = x$startfinishdates$started_date
}else{
started_date=NA
}
if(!is.null(x$startfinishdates$started_stamp)){
started_stamp = x$startfinishdates$started_date
}else{
started_stamp=NA
}
if(!is.null(x$startfinishdates$finished_date)){
finished_date = x$startfinishdates$finished_date
}else{
finished_date=NA
}
if(!is.null(x$startfinishdates$finished_stamp)){
finished_stamp = x$startfinishdates$finished_stamp
}else{
finished_stamp=NA
}
}else{
started_stamp = NA
started_date = NA
finished_stamp = NA
finished_date = NA
}
book_id = x$book_id
title = x$title
author = x$author_fl
rating = x$rating
collections = paste(unlist(x$collections),collapse = ",")
return(data.frame(ID=book_id,Title=title,Author=author,Rating=rating,
Collections=collections,Started_date=started_date,Started_stamp=started_stamp,
Finished_date=finished_date,Finished_stamp=finished_stamp))
},data$books))
#Report Section
output<-"D:/R/Reference program for R/Table_EG_chg.doc" # although this is RTF, we can use the
rtf<-RTF(output,width=8.5,height=11,font.size=9,omi=c(0.5,0.5,0.5,0.5))
addHeader(rtf,title = " Table14.3.2.3.1", subtitle =" Vital Signs - Absolute Values", font.size=9,TOC.level=0)
addTable(rtf,final,font.size=9,row.names=FALSE,NA.string="0",col.justify='L',header.col.justify='L',col.widths=c(1.75,1.5,1.25,0.5,0.5,0.5,0.5,0.5,0.5))
addTable(rtf,as.data.frame(head(iris)),font.size=10,row.names=FALSE,NA.string="-")
addText(rtf, "\n\n", bold=TRUE, italic=FALSE)
done(rtf) # writes and closes the file
final is my data frame which i need to print in the RTF output.
This is the code i have used to create the output in RTF. It works fines for first page alone, for the rest of the page output doesn't have Title and footnotes in all the pages. Please can anyone has done the this method if so please can you send the code...
This is easily done in SAS. I need it in R.
Any one has answer for this.....
Think you are asking the listings where we can do in SAS programming, I have tried using R program and got the outputs. Please find the below code I have used a dummy dataset and applied logics which you need to get the rtf document where we can see titles and footnotes in multiple pages.
library(rtf)
final <- data.frame(Subject = c(1001,1002,1003,1004,1005,1006), Country = c("USA","IND","CHN","JPN","SA","EUR"),
Age = c(50,60,51,63,73,65), Sex = c("M","F","M","F","M","F"), SBP = c(120,121,119,123,126,128),
DBP = c(80,70,75,85,89,71))
final$seq <- rep(seq(1,nrow(final),2),each =2)
rtf<-RTF("Table_EG_chg.rtf",width=11,height=5,font.size=9,omi=c(0.5,0.5,0.5,0.5))
for ( i in unique(final$seq)){
new <- final [final$seq == i , ]
new$seq <- NULL
name.width <- max(sapply(names(new), nchar))
new <- format(new, justify = "centre")
addHeader(rtf,title = "\t\t\t\t\t\t\t\t\tTable14.3.2.3.1", subtitle ="\t\t\t\t\t\t\t\t\tVital Signs - Absolute Values", font.size=9)
addTable(rtf,new,font.size=9,row.names=FALSE,NA.string="0",col.justify='L',header.col.justify='L',col.widths=c(1.75,1.5,1.25,1.5,1.5,1.5))
startParagraph.RTF(rtf)
addText.RTF(rtf,paste("\n","- Vital signs lab values are collected at the day of ICF.\n"))
addText.RTF(rtf,"- Vital signs SBP - systolic blood pressure; DBP - Diastolic blood pressure")
endParagraph.RTF(rtf)
addPageBreak(rtf, width=11,height=5,font.size=9,omi=rep(0.5,0.5,0.5,0.5))
}
done(rtf)
#Jaikumar Sorry it took 6 years for a package to come out that can finally do what you want. At the end of last year, the reporter package was released. This package replicates a lot of the functionality of SAS proc report. It can do dataset listings, just like SAS. It will repeat titles and footnotes on every page, without having to do anything special. Here is an example:
library(reporter)
library(magrittr)
# Create table
tbl <- create_table(iris) %>%
titles("Sample Title for Iris Data") %>%
footnotes("My footnote")
# Create report and add table to report
rpt <- create_report("test.rtf", output_type = "RTF") %>%
add_content(tbl)
# Write the report
write_report(rpt)
It can also print in RTF, PDF, and TXT. To use PDF, just change the file name and the output_type.