Blank lines formatting officer package R - r

I am currently using the Officer package to produce a Word document. I have been using body_add_par to add multiple chucks of blank lines throughout the document, but this method is already becoming tedious.
Is there a way to create a function, or somehow be able to write one line of code that is able to specify how many blank lines I want to insert?
Practice_R.docx = read_docx() %>%
body_add_par("") %>%
body_add_par("") %>%
body_add_par("") %>%
body_add_par(paste("test")) %>%
body_add_par("") %>%
body_add_par("") %>%
body_add_par("") %>%
body_add_par("") %>%
body_add_par("") %>%
body_add_par("") %>%
body_add_par(paste("test2"))%>%
body_add_break( pos = "after")

Try:
# repeat the addition of an empty paragraph n-times
body_add_par_n <- function(doc, n) {
i <- 1 # initialize counter
while (i<=n) { # control if the counter is less then desired n
doc <- body_add_par(doc, "") # add paragraph to the object
i <- i+1 # increment the counter
}
doc # return the object
}
Practice_R.docx = read_docx() %>%
body_add_par_n(3) %>%
body_add_par(paste("test")) %>%
body_add_par_n(6) %>%
body_add_par(paste("test2"))%>%
body_add_break( pos = "after")

Related

How to force linebreaks in kableExtra functions with escape = FALSE?

In kableExtra >= 0.8.0, the canonical way to insert a linebreak into text piped into a table from a kableExtra function such as add_header_above or pack_rows is to add an \n directly.
However, this appears not to work with the escape = FALSE argument, which is required if the text also contains LaTeX code.
How can one force linebreaks in kableExtra functions with escape = FALSE?
library(dplyr)
library(knitr)
library(kableExtra)
starwars %>%
filter(species == 'Gungan' | species == 'Droid') %>%
arrange(species) %>%
select(name, eye_color) %>%
kbl(booktabs = TRUE) %>%
pack_rows(
index = c(
'The droids: everybody\'s favourite' = 6,
'The Gungans: only beloved of \nthose aged under $3^2$' = 3),
escape = FALSE)
ISSUE
The issue at hand is that you wish to escape part of your header (i.e., the break) and not escape another part (i.e., the math code).
Further Complications
This core issue is further complicated by a number of factors:
when and how kableExtra is programmed to deal with escaping
a desire to have a solution that works for both html and LaTeX output
when and how R evaluates code
A SOLUTION
Here is a solution that will work for both html and LaTeX output, but it is not as clean and straight forward as your original code:
# a new version of `kableExtra::linebreak()` that takes into account what type
# of output is desired as well as how much escaping is necessary
linebreak2 <- function(x, double_escape = TRUE, ...) {
# if LaTeX insert text into a `\makecell[]{}` command and double escape
if(knitr::is_latex_output())
return(linebreak(x, double_escape = double_escape, ...))
# if html output just replace `\n`s with `<br/>`s
if(knitr::is_html_output())
return(gsub("\n", "<br/>", x))
# let x pass through for other types of output
return(x)
}
# build the index named vector outside the pipe flow
# in order to set the names using `linebreak2()`
index <- c(6, 3)
names(index) <- c(
'The droids: everybody\'s favourite',
linebreak2('The Gungans: only beloved of \nthose aged under $3^2$')
)
# proceed as before
starwars %>%
filter(species == 'Gungan' | species == 'Droid') %>%
arrange(species) %>%
select(name, eye_color) %>%
kbl(booktabs = TRUE) %>%
pack_rows(index = index, escape = FALSE)
PDF Output
HTML Output
You could use html line break tag <br/>:
starwars %>%
filter(species == 'Gungan' | species == 'Droid') %>%
arrange(species) %>%
select(name, eye_color) %>%
kbl(booktabs = TRUE) %>%
pack_rows(
index = c(
'The droids: everybody\'s favourite' = 6,
'The Gungans: only beloved of <br/> those aged under $3^2$' = 3),
escape = FALSE)

How to save multiples match in one column? rvest, R and stringr

This question is a sequence to the problem stackoverflow
I have these two example html: url1.html ; url2.html
The url3.html is another example with more IPC
In URL2.html there is no information (51) and in URL1.html there is.
I'm using this code in R:
library(rvest)
library(tidyverse)
library(stringr)
x<-data.frame(
URL=c(1:2),
page=c(paste(readLines("url1.html"), collapse="\n"),
paste(readLines("url2.html"), collapse="\n"))
)
for (i in 1:nrow(x)){
html<-x$page[i]%>% unclass() %>% unlist()
read_html(html,encoding = "ISO-8859-1") %>%
rvest::html_elements(xpath = '//*[#id="principal"]/table[2]') %>%
html_nodes(xpath='//div[#id="classificacao0"]') %>%
html_text(trim=T)%>%
str_replace_all(.,"[\\n\\r\\t]+", "")%>%
stringr::str_trim( ) -> tmp
if(length(tmp) == 0) tmp <- "ND"
x$ipc_0[i] <- tmp %>% str_replace_all(.,"\\s+", " ") %>% str_replace_all(.," \\)", "\\)")
}
for (i in 1:nrow(htm_temp)){
html<-x$page[i]%>% unclass() %>% unlist()
read_html(html,encoding = "ISO-8859-1") %>%
rvest::html_elements(xpath = '//*[#id="principal"]/table[2]') %>%
html_nodes(xpath='//div[#id="classificacao1"]') %>%
html_text(trim=T)%>%
str_replace_all(.,"[\\n\\r\\t]+", "")%>%
stringr::str_trim( ) -> tmp
if(length(tmp) == 0) tmp <- "ND"
x$ipc_1[i] <- tmp %>% str_replace_all(.,"\\s+", " ") %>% str_replace_all(.," \\)", "\\)")
}
Result: partially correct
Desired result:create a new dataframe with the following structure.
URL
IPC
1
B62B 1/16 (1968.09)...
1
B62B 1/00 (1968.09)...
2
ND
Problem: There are url`s that have the code (51) and others that do not. When you have the code (51) the structure can contain "n" id with the following structure xpath='//div[#id="classificacao0"]. the Rating Id can contain values from 0 to "n". How to optimize this code to capture the necessary information without having to do a lot of for (variable in vector) for each "n"?
Any idea how to solve this problem?
You can use css attribute = value css selector list with ^ starts with operator to capture/exclude elements with specific id and id values.
Convert your current extraction code into a function which accepts (in this case) an url as argument. Extend the regex to remove the other characters not shown in your desired output.
Have that function return a tibble of url and ipcs found; wrap the whole thing in a map_dfr() call to generate a single DataFrame result.
library(rvest)
library(tidyverse)
urls <- sprintf("https://prequest.websiteseguro.com/example/url%i.html", 1:3)
get_ipc <- function(url) {
ipc <- read_html(url, encoding = "ISO-8859-1") %>%
html_elements("div[id^=classificacao]:not([id^=classificacaoc]) .normal > b") %>%
html_text(trim = T) %>%
str_replace_all(., "[\\n\\r\\t]+|\\(|\\s{2,}|\\)", "")%>%
stringr::str_trim()
if(length(ipc) == 0) ipc <- "ND"
return(tibble(url = url, ipc))
}
df <- purrr::map_dfr(urls, get_ipc)
print(df)

Web Scraping Across multiple pages R

I have been working on some R code. The purpose is to collect the average word length and other stats about the words in a section of a website with 50 pages. Collecting the stats is no problem and it's a easy part. However, getting my code to collect the stats over 50 pages is the hard part, it only ever seems to output information from the first page. See the code below and ignore the poor indentation.
install.packages(c('tidytext', 'tidyverse'))
library(tidyverse)
library(tidytext)
library(rvest)
library(stringr)
websitePage <- read_html('http://books.toscrape.com/catalogue/page-1.html')
textSort <- websitePage %>%
html_nodes('.product_pod a') %>%
html_text()
for (page_result in seq(from = 1, to = 50, by = 1)) {
link = paste0('http://books.toscrape.com/catalogue/page-',page_result,'.html')
page = read_html(link)
# Creates a tibble
textSort.tbl <- tibble(text = textSort)
textSort.tidy <- textSort.tbl %>%
funnest_tokens(word, text)
}
# Finds the average word length
textSort.tidy %>%
map(nchar) %>%
map(mean)
# Finds the most common words
textSort.tidy %>%
count(word, sort = TRUE)
# Removes the stop words and then finds most common words
textSort.tidy %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
# Counts the number of times the word "Girl" is in the text
textSort.tidy %>%
count(word) %>%
filter(word == "Girl")
You can use lapply/map to extract the tetx from multiple links.
library(rvest)
link <- paste0('http://books.toscrape.com/catalogue/page-',1:50,'.html')
result <- lapply(link, function(x) x %>%
read_html %>%
html_nodes('.product_pod a') %>%
html_text)
You can continue using lapply if you want to apply other functions to text.

How to properly parse (?) mdsets in expss within a loop?

I'm new to R and I don't know all basic concepts yet. The task is to produce a one merged table with multiple response sets. I am trying to do this using expss library and a loop.
This is the code in R without a loop (works fine):
#libraries
#blah, blah...
#path
df.path = "C:/dataset.sav"
#dataset load
df = read_sav(df.path)
#table
table_undropped1 = df %>%
tab_cells(mdset(q20s1i1 %to% q20s1i8)) %>%
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
There are 10 multiple response sets therefore I need to create 10 tables in a manner shown above. Then I transpose those tables and merge. To simplify the code (and learn something new) I decided to produce tables using a loop. However nothing works. I'd looked for a solution and I think the most close to correct one is:
#this generates a message: '1' not found
for(i in 1:10) {
assign(paste0("table_undropped",i),1) = df %>%
tab_cells(mdset(assign(paste0("q20s",i,"i1"),1) %to% assign(paste0("q20s",i,"i8"),1)))
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
}
Still it causes an error described above the code.
Alternatively, an SPSS macro for that would be (published only to better express the problem because I have to avoid SPSS):
define macro1 (x = !tokens (1)
/y = !tokens (1))
!do !i = !x !to !y.
mrsets
/mdgroup name = !concat($SET_,!i)
variables = !concat("q20s",!i,"i1") to !concat("q20s",!i,"i8")
value = 1.
ctables
/table !concat($SET_,!i) [colpct.responses.count pct40.0].
!doend
!enddefine.
*** MACRO CALL.
macro1 x = 1 y = 10.
In other words I am looking for a working substitute of !concat() in R.
%to% is not suited for parametric variable selection. There is a set of special functions for parametric variable selection and assignment. One of them is mdset_t:
for(i in 1:10) {
table_name = paste0("table_undropped",i)
..$table_name = df %>%
tab_cells(mdset_t("q20s{i}i{1:8}")) %>% # expressions in the curly brackets will be evaluated and substituted
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
}
However, it is not good practice to store all tables as separate variables in the global environment. Better approach is to save all tables in the list:
all_tables = lapply(1:10, function(i)
df %>%
tab_cells(mdset_t("q20s{i}i{1:8}")) %>%
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
)
UPDATE.
Generally speaking, there is no need to merge. You can do all your work with tab_*:
my_big_table = df %>%
tab_total_row_position("none")
for(i in 1:10) {
my_big_table = my_big_table %>%
tab_cells(mdset_t("q20s{i}i{1:8}")) %>% # expressions in the curly brackets will be evaluated and substituted
tab_stat_cpct()
}
my_big_table = my_big_table %>%
tab_pivot(stat_position = "inside_columns") # here we say that we need combine subtables horizontally

Loop over list in R, conduct analysis specific to element in list, save results in element dataframe?

I am trying to replicate an analysis using tidytext in R, except using a loop. The specific example comes from Julia Silge and David Robinson's Text Mining with R, a Tidy Approach. The context for it can be found here: https://www.tidytextmining.com/sentiment.html#sentiment-analysis-with-inner-join.
In the text, they give an example of how to do sentiment analysis using the NRC lexicon, which has eight different sentiments, including joy, anger, and anticipation. I'm not doing an analysis for a specific book like the example, so I commented out that line, and it still works:
nrc_list <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
wordcount_joy <- wordcount %>%
# filter(book == "Emma") %>%
inner_join(nrc_list) %>%
count(word, sort = TRUE)
As I said before, this works. I now want to modify it to loop over all eight emotions, and save the results in a dataframe labeled with the emotion. How I tried to modify it:
emotion <- c('anger', 'disgust', 'joy', 'surprise', 'anticip', 'fear', 'sadness', 'trust')
for (i in emotion) {
nrc_list <- get_sentiments("nrc") %>%
filter(sentiment == "i")
wcount[[i]] <- wordcount %>%
inner_join(nrc_list) %>%
count(word, sort = TRUE)
}
I get an "Error: object 'wcount' not found" message when I do this. I have googled this and it seems like the answers to this question is to use wcount[[i]] but clearly something is off when I tried adapting it. Do you have any suggestions?
Code below will do the trick. Note that you are refering to wordcount in your loop and the example uses tidybooks. Code follows the steps as in the link to tidytextmining you are refering to.
library(janeaustenr)
library(dplyr)
library(stringr)
library(tidytext)
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
emotion <- c('anger', 'disgust', 'joy', 'surprise', 'anticip', 'fear', 'sadness', 'trust')
# initialize list with the length of the emotion vector
wcount <- vector("list", length(emotion))
# name the list entries
names(wcount) <- emotion
# run loop
for (i in emotion) {
nrc_list <- get_sentiments("nrc") %>%
filter(sentiment == i)
wcount[[i]] <- tidy_books %>%
inner_join(nrc_list) %>%
count(word, sort = TRUE)
}

Resources