Column of data not appearing after web scrape (rvest and xlsx) - web-scraping

I am currently trying to download public Treasury data and when setting up my scraping, I am only pulling the date column, 20-year column, and extrapolation factor. The 10-year column, situated in the middle of the table, is not included in the scrape and paste into excel. My code is below. directory not included.
url <- "https://www.treasury.gov/resource-center/data-chart-center/interest-
rates/Pages/TextView.aspx?data=longtermrateYear&year=2020"
ten_year_comp <- read_html(url, encoding = "table")
ten_year_comp %>%
html_nodes("table") %>%
.[[4]] %>%
html_table(fill = TRUE) %>%
write.xlsx(ten_year_comp, file = "TREASURY10YR.xlsx", sheetName = "ten_year_comp",
col.names = TRUE, row.names = TRUE, asTable = TRUE, append = FALSE)

url <- "https://www.treasury.gov/resource-center/data-chart-center/interest-rates/Pages/TextView.aspx?data=longtermrateYear&year=2020"
ten_year_comp <- html_nodes(read_html(url), "table")[[4]] %>% html_table(fill = T)
write.xlsx(
ten_year_comp,
file = "TREASURY10YR.xlsx",
sheetName = "ten_year_comp",
col.names = TRUE,
row.names = TRUE,
asTable = TRUE,
append = FALSE
)

Related

data.table doubles in size after being written

I am working with the SALES dataset and a trimmed copy of it (TEST). The problem I have is that the SALES file doubles its value when it is saved.
This occurs only with this file (SALES), when the same procedure is performed with the TEST file, the result has the same size as the original file.
I tried transforming the file to an R base data frame and the result is still the same.
Similarly, if I open the file SALES_2 and save it, the size of this file doubles again.
This is the current code:
library(jsonlite)
library(lubridate)
library(tidyverse)
library(readr)
library(stringi)
library(stringr)
library(readxl)
options(scipen = 999)
SALES <- read_delim("C:/Users/edjca/OneDrive/FORMA/PRUEBA/SALES.csv",
delim = "|", escape_double = FALSE, locale =
locale(decimal_mark = ",", grouping_mark ="."), trim_ws = TRUE)
TEST <- read_delim("C:/Users/edjca/OneDrive/FORMA/PRUEBA/Test.csv",
delim = "|", escape_double = FALSE,
locale = locale(decimal_mark = ",", grouping_mark = "."), trim_ws = TRUE)
data.table::fwrite(SALES, "C:/Users/edjca/OneDrive/FORMA/PRUEBA/SALES_2.csv", sep = "|", dec = ",")
data.table::fwrite(TEST, "C:/Users/edjca/OneDrive/FORMA/PRUEBA/Test_2.csv", sep = "|", dec = ",")
Add a picture of the results in my folder and objects sizes in R

How to read multiple PDF files in R?

I have a script that I am using to read multiple PDF files. Here is my code
corpus_raw <- data.frame("company" = c(),"text" = c(), check.names = FALSE)
for (i in 1:length(pdf_list)){
print(i)
document_text <- pdf_text(paste("V:/CodingProject2_FundOverview/", pdf_list[i],sep = "")) %>%
strsplit("\r\n")
document <- data.frame("company" = gsub(x = pdf_list[i],pattern = ".pdf", replacement = ""),
"text" = document_text, stringsAsFactors = FALSE, check.names = FALSE)
colnames(document) <- c("company", "text")
corpus_raw <- rbind(corpus_raw,document)
}
I get the following error message:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 79, 56
I even tried to keep the check.names = FALSE but it seems like I am doing something wrong. Any help will be appreciated. Thanks
I knew I was doing something stupid. Anyways, I was able to figure out the answer on my own.
for (i in 1:length(pdf_list)){
print(i)
document_text <- pdf_text(paste("V:/CodingProject2_FundOverview/", pdf_list[i],sep = "")) %>%
strsplit("\r\n")
document <- data.frame("company" = gsub(x = pdf_list[i],pattern = ".pdf", replacement = ""),
"text" = I(document_text), stringsAsFactors = FALSE, check.names = FALSE)
colnames(document) <- c("company", "text")
corpus_raw <- rbind(corpus_raw,document)
}

Import big CSV files at once in R

I`ve 70 csv files with the same columns in a folder, each of them are 0.5 GB.
I want to import them into a single dataframe in R.
Normally I import each of them correctly as below:
df <- read_delim("file.csv",
"|", escape_double = FALSE, col_types = cols(pc_no = col_character(),
id_key = col_character()), trim_ws = TRUE)
To import all of them, coded like that and error as follows:
argument "delim" is missing, with no default
tbl <-
list.files(pattern = "*.csv") %>%
map_df(~read_delim("|", escape_double = FALSE, col_types = cols(pc_no = col_character(), id_key = col_character()), trim_ws = TRUE))
With read_csv, imported but appears only one column which contains all columns and values.
tbl <-
list.files(pattern = "*.csv") %>%
map_df(~read_csv(., col_types = cols(.default = "c")))
In your second block of code, you're missing the ., so read_delim is interpreting your arguments as read_delim(file="|", delim=<nothing provided>, ...). Try:
tbl <- list.files(pattern = "*.csv") %>%
map_df(~ read_delim(., delim = "|", escape_double = FALSE,
col_types = cols(pc_no = col_character(), id_key = col_character()),
trim_ws = TRUE))
I explicitly identified delim= here but it's not strictly necessary. Had you done that in your first attempt, however, you would have seen
readr::read_delim(delim = "|", escape_double = FALSE,
col_types = cols(pc_no = col_character(), id_key = col_character()),
trim_ws = TRUE)
# Error in read_delimited(file, tokenizer, col_names = col_names, col_types = col_types, :
# argument "file" is missing, with no default
which is more indicative of the actual problem.

time series analysis of text in r

If i have some data like so:
df = data.frame(person = c('jim','john','pam','jim'),
date =c('2018-01-01','2018-02-01','2018-03-01','2018-04-01'),
text = c('the lonely engineer','tax season is upon us, engineers, do your taxes!','i am so lonely','rage coding is the best') )
and I wanted to understand trending terms by date, how can I go about that?
xCorp = corpus(df, text_field = 'text')
x = tokens(xCorp) %>% tokens_remove(
c(
stopwords('english'),
'western digital',
'wd',
'nil'),
padding = T
) %>%
dfm(
remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = T,
concatenator = ' '
)
x2 = dfm(x, groups = 'date')
This would get me part of the way there, but not sure if it's the best way.
Using the tidyverse, I was able to do the following:
df = df %>%
group_by(date) %>%
unnest_tokens(word,text) %>%
count(word,sort = T) %>%
}

How to include hyperlinks in flextable?

I am trying to make a powerpoint file with officer that includes tables with hyperlinks in their cells, but I can't find a way how to do it.
E.g. a row in a 2 column table on a slide could include 'Ensembl' in the 1st column, the 2nd column would say 'ENSG00000165025' and clicking on it would open up the browser at 'uswest.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000165025'. Some values in the 2nd column could be just plain text.
Is this possible to achieve?
With the new version on github, you will be able to include hyperlinks as demo below:
library(flextable)
dat <- data.frame(
col = "CRAN website", href = "https://cran.r-project.org",
stringsAsFactors = FALSE)
ft <- flextable(dat)
ft <- display(
ft, col_key = "col", pattern = "# {{mylink}}",
formatters = list(mylink ~ hyperlinked_text(href, col) )
)
ft
`
dat <- data.frame(
col = "entrez",
href = "https://www.ncbi.nlm.nih.gov/gene?cmd=Retrieve&dopt=full_report&list_uids=6850",
stringsAsFactors = FALSE)
ft <- flextable(dat)
ft <- display(
ft, col_key = "col", pattern = "# {{mylink}}",
formatters = list(mylink ~ hyperlink_text(href, col) )
)
ft # works fine
doc <- read_pptx() %>%
add_slide(layout = 'Title and Content', 'Office Theme') %>%
ph_with_flextable(ft) # error
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url,
as_html = as_html, : EntityRef: expecting ';' [23]
repeat with:
dat <- data.frame(
col = "entrez", href = URLencode("https://www.ncbi.nlm.nih.gov/gene?cmd=Retrieve&dopt=full_report&list_uids=6850", reserved = TRUE),
stringsAsFactors = FALSE)
ft <- flextable(dat)
ft <- display(
ft, col_key = "col", pattern = "# {{mylink}}",
formatters = list(mylink ~ hyperlink_text(href, col) )
)
ft # clicking the link in rstudio fails
doc <- read_pptx() %>%
add_slide(layout = 'Title and Content', 'Office Theme') %>%
ph_with_flextable(ft) # fine, no error message, but error message when opening pp file

Resources