Use R to mimic "clicking" on a file to download it - r

I would like R to automatically download an excel file about Oil and Gas rigs from this page. The file is downloaded by clicking on "North America Rotary Rig Count Pivot Table (Feb 2011 - Current)" (second option) but I cannot seem to find a way to do this in R.
Any clues? Thanks!
Note: Unfortunately, using download.file() does not seem to work. I get a message when trying to open the file in MS Excel that the extension is incorrect or the file is corrupt. I also get this error in R when using readxl::read_excel(): Error: Evaluation error: error -103 with zipfile in unzGetCurrentFileInfo

Some libs to help
You actually will need only dplyr, purrr, stringr, and xml2.
library(tidyverse)
library(rvest)
#> Loading required package: xml2
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:purrr':
#>
#> pluck
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
library(htmltab)
library(xml2)
library(readxl)
I like to do it this way because some sites use partial links.
base <- "https://rigcount.bakerhughes.com"
url <- paste0(base, "/na-rig-count")
# find links
url_html <- xml2::read_html(url)
url_html %>%
html_nodes("a") %>%
html_attrs() %>%
bind_rows() -> url_tbl
Check href content, find some pattern you are interested in.
You may use inspect on your browser too, it is truly helpful.
url_tbl %>%
count(href)
#> # A tibble: 22 x 2
#> href n
#> <chr> <int>
#> 1 / 1
#> 2 /email-alerts 1
#> 3 /intl-rig-count 1
#> 4 /na-rig-count 1
#> 5 /rig-count-faqs 1
#> 6 /rig-count-overview 2
#> 7 #main-menu 1
#> 8 https://itunes.apple.com/app/baker-hughes-rig-counts/id393570114?mt=8 1
#> 9 https://rigcount.bakerhughes.com/static-files/4ab04723-b638-4310-afd9-… 1
#> 10 https://rigcount.bakerhughes.com/static-files/4b92b553-a48d-43a3-b4d9-… 1
#> # … with 12 more rows
Perhaps, I notice that static-files may be a good pattern to href but then I found a better one, in type.
url_tbl %>%
filter(str_detect(type, "ms-excel")) -> url_xlsx
build our list (remember to avoid some noise as an extra dot, spaces, and special characters)
I hope someone proposes a better way to avoid those things
myFiles <- pull(url_xlsx, "href")
names <- pull(url_xlsx, "title")
names(myFiles) <- paste0(
str_replace_all(names, "[\\.\\-\\ ]", "_"),
str_extract(names, ".\\w+$")
)
# download data
myFiles %>%
imap(
~ download.file(
url = .x,
destfile = .y,
method="curl", # might be not necessary
extra="-k"
)
)
#> $`north_america_rotary_rig_count_jan_2000_-_current.xlsb`
#> [1] 0
#>
#> $`north_american_rotary_rig_count_pivot_table_feb_2011_-_current.xlsb`
#> [1] 0
#>
#> $`U.S. Monthly Averages by State 1992-2016.xls`
#> [1] 0
#>
#> $`North America Rotary Rig Counts through 2016.xls`
#> [1] 0
#>
#> $`U.S. Annual Averages by State 1987-2016.xls`
#> [1] 0
#>
#> $Workover_9.xls
#> [1] 0
Created on 2020-12-16 by the reprex package (v0.3.0)
Now you may see your files.
names(myFiles) %>%
map(
readxlsb:read_xlsb
) -> myData
I hope it helps.

Related

Read CSV files in the same order as saved in the path, in R

I have multiple CSV files that are stored in a specific order, and I want to read them in this exact same order, from the bottom to the top. They are stored like this:
tFile20.RAW
tFile17.RAW
tFile16.RAW
tFile12.RAW
tFile11.RAW
tFile10.RAW
.
.
and so on until tFile1.RAW. I've seen multiple questions about this issue but all of them were regarding python.
I'm using this code, but it is reading the files in a random order (the CSVs are stored in smallfolder):
temp = list.files(path = '/bigfolder/myname/smallfolder', pattern="RAW", full.names = TRUE)
final_list = lapply(temp, read.csv)
it's reading tFile1.RAW and then jumps to tFile10.RAW and tFile11.RAW and so on..
How can I make it read starting from tFile1.RAW and go to the top? so the first CSV file it reads would be tFile1.RAW, hence final_list[[1]] = tFile1.RAW, and then final_list[[2]] = tFile2.RAW, final_list[[3]] = tFile3.RAW and so on.
library(stringr)
Preparing the folder structure and writing files
# Creating folder
folder_path <- "bigfolder/myname/smallfolder"
dir.create(folder_path, recursive = TRUE)
# Files
files <- c("file1.csv", "file10.csv", "file11.csv", "file12.csv", "file13.csv",
"file14.csv", "file15.csv", "file16.csv", "file17.csv", "file18.csv",
"file19.csv", "file2.csv", "file20.csv", "file3.csv", "file4.csv",
"file5.csv", "file6.csv", "file7.csv", "file8.csv", "file9.csv"
)
# writing files
lapply(files, \(x) write.csv(x, file.path(folder_path, x)))
With that I have a folder structure as you described in your code, now I will
list all files I’m going to read. The only difference here is that I will use
full.names = FALSE because I think that in you local machine the path has numbers in it
temp <- list.files(folder_path)
You have to sort the files afther you use the list.file function, I would do it as follow:
Extract the integer in the name of the file
file_number <- stringr::str_extract(temp, "[0-9]+") |> as.numeric()
Get the position where each file should be, comparing the ordered file_number with
the position they actually have
correct_index_order <- sapply(sort(file_number), \(x) which(file_number == x))
Rearrange you temp vector with that new vector
temp <- temp[correct_index_order]
temp
#> [1] "file1.csv" "file2.csv" "file3.csv" "file4.csv" "file5.csv"
#> [6] "file6.csv" "file7.csv" "file8.csv" "file9.csv" "file10.csv"
#> [11] "file11.csv" "file12.csv" "file13.csv" "file14.csv" "file15.csv"
#> [16] "file16.csv" "file17.csv" "file18.csv" "file19.csv" "file20.csv"
Now we can read the files
lapply(file.path(folder_path, temp), read.csv)
#> [[1]]
#> X x
#> 1 1 file1.csv
#>
#> [[2]]
#> X x
#> 1 1 file2.csv
#>
#> [[3]]
#> X x
#> 1 1 file3.csv
#>
#> [[4]]
#> X x
#> 1 1 file4.csv
#>
#> [[5]]
#> X x
#> 1 1 file5.csv
#>
#> [[6]]
#> X x
#> 1 1 file6.csv
#>
Created on 2023-01-14 with reprex v2.0.2

Scraping movie scripts failing on small subset

I'm working on scraping the lord of the rings movie scripts from this website here. Each script is broken up across multiple pages that look like this
I can get the info I need for a single page with this code:
library(dplyr)
library(rvest)
url_success <- "http://www.ageofthering.com/atthemovies/scripts/fellowshipofthering1to4.php"
success <- read_html(url_success) %>%
html_elements("#AutoNumber1") %>%
html_table()
summary(success)
Length Class Mode
[1,] 2 tbl_df list
This works for all Fellowship of the Ring pages, and all Return of the King pages. It also works for Two Towers pages covering scenes 57 to 66. However, any other Two Towers page (scenes 1-56) does not return the same result
url_fail <- "http://www.ageofthering.com/atthemovies/scripts/thetwotowers1to4.php"
fail <- read_html(url_fail) %>%
html_elements("#AutoNumber1") %>%
html_table()
summary(fail)
Length Class Mode
0 list list
I've inspected the pages in Chrome, and the failing pages appear to have the same structure as the succeeding ones, including the 'AutoNumber1' table. Can anyone help with this?
Works with xpath. Perhaps ill-formed html (page doesn't seem too spec compliant)
library(rvest)
url_fail <- "http://www.ageofthering.com/atthemovies/scripts/thetwotowers1to4.php"
fail <- read_html(url_fail) %>%
html_elements( xpath = '//*[#id="AutoNumber1"]') %>%
html_table()
fail
#> [[1]]
#> # A tibble: 139 × 2
#> X1 X2
#> <chr> <chr>
#> 1 "Scene 1 ~ The Foundations of Stone\r\n\r\n\r\nThe movie opens as the … "Sce…
#> 2 "GANDALF VOICE OVER:" "You…
#> 3 "FRODO VOICE OVER:" "Gan…
#> 4 "GANDALF VOICE OVER:" "I a…
#> 5 "The scene changes to \r\n inside Moria.  Gandalf is on the Bridge … "The…
#> 6 "GANDALF:" "You…
#> 7 "Gandalf slams down his staff onto the Bridge, \r\ncausing it to crack… "Gan…
#> 8 "BOROMIR :" "(ho…
#> 9 "FRODO:" "Gan…
#> 10 "GANDALF:" "Fly…
#> # … with 129 more rows

Create dynamic R dataframe names in for loop - multiple names in same code line

I am trying to create dynamic dataframe names within a for loop. I am using the paste function in R to write the dataframe names. see the example below:
for (i in 1:3){
paste("Data",i,sep="") <- data.frame(colone=c(1,2,3,4),coltwo=c(5,6,7,8))
paste("New data",i,sep="") <- paste("Data",i,sep="") %>% mutate(colthree=(colone+coltwo)*i) %>% select(colthree)
}
The code above won't work as R doesn't understand paste as a dataframe name. I have found some solutions using the assign function which could help with my 1st line of code using: assign(paste("Data",i,sep=""),data.frame(colone=c(1,2,3,4),coltwo=c(5,6,7,8))) but I don't know what to do with the 2nd line where the paste function is used twice to refer to multiple dataframes. Not sure using a nested assign function works and even if it does the code will look terrible with more complex code.
I know there might be ideas of how to combine the 2 lines above into a single assign statement or other similar solutions but is there any way to refer to 2 dynamic dataframe names within a single line of code as per my example above?
Many thanks :)
If you need both data frames ("Data i" and "New Data i") you can use:
for (i in 1:3){
assign(paste("New data",i,sep=""), data.frame(assign(paste("Data",i,sep=""),data.frame(colone=c(1,2,3,4),coltwo=c(5,6,7,8))) %>% mutate(colthree=(colone+coltwo)*i) %>% select(colthree)))
}
If you only want "New Data i" use:
for (i in 1:3){
assign(paste("New data",i,sep=""), data.frame(colone=c(1,2,3,4),coltwo=c(5,6,7,8))) %>% mutate(colthree=(colone+coltwo)*i) %>% select(colthree)
}
This seems to be working but it's a little bit convoluted:
library(tidyverse)
library(stringr)
library(rlang)
#>
#> Attaching package: 'rlang'
#> The following objects are masked from 'package:purrr':
#>
#> %#%, as_function, flatten, flatten_chr, flatten_dbl, flatten_int,
#> flatten_lgl, flatten_raw, invoke, list_along, modify, prepend,
#> splice
i <- seq(1, 3, 1) #how many loops
pwalk(list(paste("Data",i,sep=""), paste("New_data",i,sep=""), i), ~{
assign(..1, data.frame(colone=c(1,2,3,4),coltwo=c(5,6,7,8)), envir = .GlobalEnv)
new <- sym(..1) #convert a string to a variable name
assign(..2, {eval_tidy(new)%>% mutate(colthree=(colone+coltwo)*..3) %>% select(colthree)}, envir = .GlobalEnv)
})
names(.GlobalEnv)
#> [1] "Data1" "Data2" "Data3" "i" "New_data1" "New_data2"
#> [7] "New_data3"
Data1
#> colone coltwo
#> 1 1 5
#> 2 2 6
#> 3 3 7
#> 4 4 8
New_data1
#> colthree
#> 1 6
#> 2 8
#> 3 10
#> 4 12
Created on 2021-06-10 by the reprex package (v2.0.0)

Extracting a list of links from a webpage by using its class

I am trying to extract from this website a list of four links that are clearly named as:
PNADC_012018_20190729.zip
PNADC_022018_20190729.zip
PNADC_032018_20190729.zip
PNADC_042018_20190729.zip
I've seen that they are all part of a class called 'jstree-wholerow'. I'm not really good at scraping, yet I've tried to capture such links using this regularity:
x <- rvest::read_html('https://www.ibge.gov.br/estatisticas/downloads-estatisticas.html?caminho=Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018') %>%
rvest::html_nodes("jstree-wholerow") %>%
rvest::html_text()
However, I received an empty vector as output.
Can someone help fixing this?
Although the webpage uses javascript, the files are stored in a ftp. It also has very predictable directory names.
library(tidyverse)
library(stringr)
library(rvest)
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
library(RCurl)
#>
#> Attaching package: 'RCurl'
#> The following object is masked from 'package:tidyr':
#>
#> complete
link <- 'https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_042018_20190729.zip'
zip_names <- c('PNADC_012018_20190729.zip', 'PNADC_022018_20190729.zip', 'PNADC_032018_20190729.zip', 'PNADC_042018_20190729.zip')
links <- str_replace(link, '/2018.*\\.zip$', str_c('/2018/', zip_names))
links
#> [1] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_012018_20190729.zip"
#> [2] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_022018_20190729.zip"
#> [3] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_032018_20190729.zip"
#> [4] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_042018_20190729.zip"
#option 2
links <- RCurl::getURL(url = 'https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/') %>% read_html() %>%
html_nodes(xpath = '//td/a[#href]') %>% html_attr('href')
links <- links[-1]
links
#> [1] "PNADC_012018_20190729.zip" "PNADC_022018_20190729.zip"
#> [3] "PNADC_032018_20190729.zip" "PNADC_042018_20190729.zip"
str_c('https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/', links)
#> [1] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_012018_20190729.zip"
#> [2] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_022018_20190729.zip"
#> [3] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_032018_20190729.zip"
#> [4] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_042018_20190729.zip"
Created on 2021-06-11 by the reprex package (v2.0.0)

Web scraping with R and selector gadget

I am trying to scrape data from a website using R. I am using rvest in an attempt to mimic an example scraping the IMDB page for the Lego Movie. The example advocates use of a tool called Selector Gadget to help easily identify the html_node associated with the data you are seeking to pull.
I am ultimately interested in building a data frame that has the following schema/columns:
rank, blog_name, facebook_fans, twitter_followers, alexa_rank.
My code below. I was able to use Selector Gadget to correctly identity the html tag used in the Lego example. However, following the same process and same code structure as the Lego example, I get NAs (...using firstNAs introduced by coercion[1] NA
). My code is below:
data2_html = read_html("http://blog.feedspot.com/video_game_news/")
data2_html %>%
html_node(".stats") %>%
html_text() %>%
as.numeric()
I have also experimented with: html_node("html_node(".stats , .stats span")), which seems to work for the "Facebook fans" column since it reports 714 matches, however only returns 1 number is returned.
714 matches for .//*[#class and contains(concat(' ', normalize-space(#class), ' '), ' stats ')] | .//*[#class and contains(concat(' ', normalize-space(#class), ' '), ' stats ')]/descendant-or-self::*/span: using first{xml_node}
<td>
[1] <span>997,669</span>
This may help you:
library(rvest)
d1 <- read_html("http://blog.feedspot.com/video_game_news/")
stats <- d1 %>%
html_nodes(".stats") %>%
html_text()
blogname <- d1%>%
html_nodes(".tlink") %>%
html_text()
Note that it is html_nodes (plural)
Result:
> head(blogname)
[1] "Kotaku - The Gamer's Guide" "IGN | Video Games" "Xbox Wire" "Official PlayStation Blog"
[5] "Nintendo Life " "Game Informer"
> head(stats,12)
[1] "997,669" "1,209,029" "873" "4,070,476" "4,493,805" "399" "23,141,452" "10,210,993" "879"
[10] "38,019,811" "12,059,607" "500"
blogname returns the list of blog names that is easy to manage. On the other hand the stats info comes out mixed. This is due to the way the stats class for Facebook and Twitter fans are indistinguishable from one another. In this case the output array has the information every three numbers, that is stats = c(fb, tw, alx, fb, tw, alx...). You should separate each vector from this one.
FBstats = stats[seq(1,length(stats),3)]
> head(stats[seq(1,length(stats),3)])
[1] "997,669" "4,070,476" "23,141,452" "38,019,811" "35,977" "603,681"
You can use html_table to extract the whole table with minimal work:
library(rvest)
library(tidyverse)
# scrape html
h <- 'http://blog.feedspot.com/video_game_news/' %>% read_html()
game_blogs <- h %>%
html_node('table') %>% # select enclosing table node
html_table() %>% # turn table into data.frame
set_names(make.names) %>% # make names syntactic
mutate(Blog.Name = sub('\\s?\\+.*', '', Blog.Name)) %>% # extract title from name info
mutate_at(3:5, parse_number) %>% # make numbers actually numbers
tbl_df() # for printing
game_blogs
#> # A tibble: 119 x 5
#> Rank Blog.Name Facebook.Fans Twitter.Followers Alexa.Rank
#> <int> <chr> <dbl> <dbl> <dbl>
#> 1 1 Kotaku - The Gamer's Guide 997669 1209029 873
#> 2 2 IGN | Video Games 4070476 4493805 399
#> 3 3 Xbox Wire 23141452 10210993 879
#> 4 4 Official PlayStation Blog 38019811 12059607 500
#> 5 5 Nintendo Life 35977 95044 17727
#> 6 6 Game Informer 603681 1770812 10057
#> 7 7 Reddit | Gamers 1003705 430017 25
#> 8 8 Polygon 623808 485827 1594
#> 9 9 Xbox Live's Major Nelson 65905 993481 23114
#> 10 10 VG247 397798 202084 3960
#> # ... with 109 more rows
It's worth checking that everything is parsed like you want, but it should be usable at this point.
This uses html_nodes (plural) and str_replace to remove commas in numbers. Not sure if these are all the stats you need.
library(rvest)
library(stringr)
data2_html = read_html("http://blog.feedspot.com/video_game_news/")
data2_html %>%
html_nodes(".stats") %>%
html_text() %>%
str_replace_all(',', '') %>%
as.numeric()

Resources