Reading docx file with R and converting the values to numeric - r

I have this docx file ( https://github.com/rhozon/datasets/blob/master/idades.docx?raw=true ) and I want to read this file directly in the link and convert the numbers to numeric.
I´m using textreadr::read_docx(... but I believe that here I can find an better way to do this.

A possible solution:
library(tidyverse)
library(textreadr)
read_docx("idades.docx") %>%
str_split(",") %>%
unlist() %>%
str_trim() %>%
as.numeric()

Related

How to save the summarise data into a dataframe

I'm currently using the group_by() function and summaraise() to get the sum of my columns, is it possible to save that information to another data frame somehow? Maybe even create a csv file with its information. Thanks
workday %>% group_by(Date) %>% mutate_if(is.character,as.numeric) %>% summarise(across(Axis1:New_Sitting,sum))
Store the pipe result in a new object, say a
a <- workday %>% group_by(Date) %>% #other ops on workday
To save it to a file there are several options, including the base write.csv:
write.csv(a, "Path to file")

Passing String to dplyr as a Function?

I am interested in the ability to pass a string not as an argument within a function but as an entire function. This may not be the smartest approach but I am simply curious so that I can understand the functionality between dplyr and how R interprets strings. Perhaps I am missing something very obvious but here are my attempts:
#what i want----
library(dplyr)
mtcars %>% count()
#replicate by passing string as count---
#feed string as a function
my_string = "count()"
#attempt 1
mtcars %>% my_string
#attempt 2
mtcars %>% eval(noquote(my_string))
#neither of the attempts work
If this is not possible I understand, but it would be interesting if possible as I can see some applications for this in my mind.
EDIT
A little more to explain why I want to do this. I have worked with fst files for some time for some very large data and load data into my environment like so, often performing operations on one file at a time and in parallel which is very efficient for my purposes:
#pseudo code---
seq.Date(1,2,by = "days") %>%
pblapply(function(x){
read.fst(list.files(as.character(x), as.data.table = T) %>%
#this portion turn into a string----
group_by(foo) %>%
count()
#------------------------------
}) %>% rbindlist()
#application-------
my_string = "group_by(foo) %>%
count()"
seq.Date(1,2,by = "days") %>%
pblapply(function(x){
read.fst(list.files(as.character(x), as.data.table = T) %>% my_string
}) %>% rbindlist()
I use data table more often but I think dplyr might be better for this specific task I am interested with. What I want to be able to do is separately write out the entire pipeline as a string and then pass it. This will allow me to write out a library package to shorten my workflow. Something to that effect.

How to extract tabular data from a website using R

I am trying to extract the data from the webpage
https://www.geojit.com/other-market/world-indices
and many others similar to this.
I need to get the tabular data of the website (INDEX,NAME,COUNTRY,CLOSE,PREV.CLOSE,NET CHANGE,CHANGE (%),LAST UPDATED DATE & TIME). would be great if you can share the R code for this or any help would be welcome.
library(rvest)
library(dplyr)
google <- html("https://www.geojit.com/other-market/world-indices")
google %>%
html_nodes()
library(rvest)
my_tbl <- read_html("https://www.geojit.com/other-market/world-indices") %>%
html_nodes(xpath = "//*[#id=\"aboutContent\"]/div[2]/table") %>%
html_table(header = TRUE) %>%
`[[`(1)

R: Converting characters to numbers in an R data.frame

A question regarding this data extraction I did. I would like to create a bar chart with the data but unfortunately I am unable to convert the characters extracted to numbers inside R. If I edit the file in a text editor, there's no porblem at all but I'd like to do the whole process in R. Here it is the code:
install.packages("rvest")
library(rvest)
url <- "https://en.wikipedia.org/wiki/Corporate_tax"
corporatetax <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="mw-content-text"]/div/table[5]') %>%
html_table()
str(corporatetax)
As a result in corporatetax there is a data.frame with 3 variables all of them characters. My question, which I've not been abe to resolve, is how should I proceed to convert the second and the third column to numbers to create a bar chart? I've tried with sapply() and dplyr() but did not find a correct way to do that.
Thanks!
You might try to clean up the table like this
library(rvest)
library(stringr)
library(dplyr)
url <- "https://en.wikipedia.org/wiki/Corporate_tax"
corporatetax <- url %>%
read_html() %>%
# your xpath defines the single table, so you can use html_node() instead of html_nodes()
html_node(xpath='//*[#id="mw-content-text"]/div/table[5]') %>%
html_table() %>% as_tibble() %>%
setNames(c("country", "corporate_tax", "combined_tax"))
corporatetax %>%
mutate(corporate_tax=as.numeric(str_replace(corporate_tax, "%", ""))/100,
combined_tax=as.numeric(str_replace(combined_tax, "%", ""))/100
)

Extracting multiple pieces of text from multiple web pages

The first part of this code (up to "pages") successfully retrieves the pages from which I want to scrape. I'm then struggling to find a way to extract pieces of article text, with the associated dates, as a data frame.
I get:
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "c('xml_document', 'xml_node')"
Any guidance on elegance, clarity and efficiency also welcome as this is personal learning.
library(rvest)
library(tidyverse)
library(plyr)
library(stringr)
llply(1:2, function(i) {
read_html(str_c("http://www.thetimes.co.uk/search?p=", i, "&q=tech")) %>%
html_nodes(".Headline--regular a") %>%
html_attr("href") %>%
url_absolute("http://www.thetimes.co.uk")
}) -> links
pages <- links %>% unlist() %>% map(read_html)
map_df(pages, function(x) {
text = read_html(x) %>%
html_nodes(".Article-content p") %>%
html_text() %>%
str_extract(".+skills.+")
date = read_html(x) %>%
html_nodes(".Dateline") %>%
html_text()
}) -> article_df
Nice, you were nearly there! There are two mistakes here:
The variable pages already contains the parsed html code. Therefore, applying read_html again on a single page (i.e. inside map_df) doesn't work. This is the error message you get.
The function inside map_df isn't correct. As there is no explicit return the last calculated value is returned, that is date. The variable text is completely forgotten. You have to pack these two variables inside a data frame.
The following contains the fixed code.
article_df <- map_df(pages, function(x) {
data_frame(
text = x %>%
html_nodes(".Article-content p") %>%
html_text() %>%
str_extract(".+skills.+"),
date = x %>%
html_nodes(".Dateline") %>%
html_text()
)
})
Also a few comments on the code itself:
I think it is better to use <- instead of ->. This way one can more easily find where the variable is assigned and if one uses 'speaking variable names' it is much easier to understand the code.
I'd prefer using the package purrr instead of plyr. purrr is part of the tidyverse package. So, instead of the function llply you could simply use map. There is a nice article on purrr vs plyr.
links <- map(1:2, function(i) {
read_html(str_c("http://www.thetimes.co.uk/search?p=", i, "&q=tech")) %>%
html_nodes(".Headline--regular a") %>%
html_attr("href") %>%
url_absolute("http://www.thetimes.co.uk")
})

Resources