I am trying to sort a list files in a directory, I used differents libraries but all gave me the same result, example:
myFiles <- paste0("Archivo_", c(1:2),"S",rep(c(2010:2015), each=2), ".txt")
# install.packages ('gtools')
library ('gtools')
mixedsort(myFiles)
sort(myFiles)
# install.packages("naturalsort")
library("naturalsort")
naturalsort(myFiles)
[1] "Archivo_1S2010.txt" "Archivo_1S2011.txt" "Archivo_1S2012.txt"
[4] "Archivo_1S2013.txt" "Archivo_1S2014.txt" "Archivo_1S2015.txt"
[7] "Archivo_2S2010.txt" "Archivo_2S2011.txt" "Archivo_2S2012.txt"
[10] "Archivo_2S2013.txt" "Archivo_2S2014.txt" "Archivo_2S2015.txt"
I would like get
myFiles
"Archivo_1S2010.txt" "Archivo_2S2010.txt" "Archivo_1S2011.txt"
"Archivo_2S2011.txt" "Archivo_1S2012.txt" "Archivo_2S2012.txt"
"Archivo_1S2013.txt" "Archivo_2S2013.txt" "Archivo_1S2014.txt"
"Archivo_2S2014.txt" "Archivo_1S2015.txt" "Archivo_2S2015.txt"
library(dplyr)
myFiles %>%
tibble(archivo=.) %>%
mutate(archivo_ref = gsub("_\\dS", "", archivo)) %>%
arrange(archivo_ref) %>%
select(archivo) %>%
unlist %>%
unname
[1] "Archivo_1S2010.txt" "Archivo_2S2010.txt" "Archivo_1S2011.txt" "Archivo_2S2011.txt"
[5] "Archivo_1S2012.txt" "Archivo_2S2012.txt" "Archivo_1S2013.txt" "Archivo_2S2013.txt"
[9] "Archivo_1S2014.txt" "Archivo_2S2014.txt" "Archivo_1S2015.txt" "Archivo_2S2015.txt"
Related
I am using rvest to scrape an IMDB list and want to access the list of full cast and crew. Unfortunately, IMDB has created a summary page when you click on the title and it takes me to the wrong page.
This is the webpage I get: https://www.imdb.com/title/tt1375666/?ref_=ttls_li_tt
This is the webpage I need: https://www.imdb.com/title/tt1375666/fullcredits/?ref_=tt_ql_cl
Notice the addition of the /fullcredits in the URL.
How can I insert /fullcredits into the middle of a URL I have built?
#install.packages("rvest")
#install.packages("dplyr")
library(rvest) #webscraping package
library(dplyr) #piping
link = "https://www.imdb.com/list/ls006266261/?st_dt=&mode=detail&page=1&sort=list_order,asc"
credits = "fullcredits/"
page = read_html(link)
name <- page %>% rvest::html_nodes(".lister-item-header a") %>% rvest::html_text()
movie_link = page %>% rvest::html_nodes(".lister-item-header a") %>% html_attr("href") %>% paste("https://www.imdb.com", ., sep="")
Here is an option - get the dirname and basename from the link, replace the substring of the basename with new substring ("tt_ql_cl") and join them again with file.path after inserting the "fullcredits" in between
library(stringr)
movie_link2 <- file.path(dirname(movie_link), "fullcredits",
str_replace(basename(movie_link), "ttls_li_tt", "tt_ql_cl"))
-output
> head(movie_link2)
[1] "https://www.imdb.com/title/tt0068646/fullcredits/?ref_=tt_ql_cl"
[2] "https://www.imdb.com/title/tt0099685/fullcredits/?ref_=tt_ql_cl"
[3] "https://www.imdb.com/title/tt0110912/fullcredits/?ref_=tt_ql_cl"
[4] "https://www.imdb.com/title/tt0114814/fullcredits/?ref_=tt_ql_cl"
[5] "https://www.imdb.com/title/tt0078788/fullcredits/?ref_=tt_ql_cl"
[6] "https://www.imdb.com/title/tt0117951/fullcredits/?ref_=tt_ql_cl"
> tail(movie_link2)
[1] "https://www.imdb.com/title/tt0144084/fullcredits/?ref_=tt_ql_cl"
[2] "https://www.imdb.com/title/tt0119654/fullcredits/?ref_=tt_ql_cl"
[3] "https://www.imdb.com/title/tt0477348/fullcredits/?ref_=tt_ql_cl"
[4] "https://www.imdb.com/title/tt0080339/fullcredits/?ref_=tt_ql_cl"
[5] "https://www.imdb.com/title/tt0469494/fullcredits/?ref_=tt_ql_cl"
[6] "https://www.imdb.com/title/tt1375666/fullcredits/?ref_=tt_ql_cl"
Another way,
df1 = gsub("\\?.*", "", movie_link)
df = paste0(df1, 'fullcredits/?ref_=tt_ql_cl')
df
[1] "https://www.imdb.com/title/tt0068646/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0099685/fullcredits/?ref_=tt_ql_cl"
[3] "https://www.imdb.com/title/tt0110912/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0114814/fullcredits/?ref_=tt_ql_cl"
[5] "https://www.imdb.com/title/tt0078788/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0117951/fullcredits/?ref_=tt_ql_cl"
[7] "https://www.imdb.com/title/tt0137523/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0108052/fullcredits/?ref_=tt_ql_cl"
[9] "https://www.imdb.com/title/tt0118749/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0105236/fullcredits/?ref_=tt_ql_cl"
[11] "https://www.imdb.com/title/tt0111161/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0073195/fullcredits/?ref_=tt_ql_cl"
[13] "https://www.imdb.com/title/tt0075314/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0119488/fullcredits/?ref_=tt_ql_cl"
I would like to filter from a dataframe observations for a given year-month and then save it as a separate dataframe and name it with the respective year-month.
I would be grateful if someone could suggest a more efficient code than the one below. Also, this code is not filtering correctely the observations.
data <- data.frame(year = c(rep(2012,12),rep(2013,12),rep(2014,12),rep(2015,12),rep(2016,12)),
month = rep(1:12,5),
info = seq(60)*100)
years <- 2012:2016
months <- 1:12
for(year in years){
for(month in months){
data_sel <- data %>%
filter(year==year & month==month)
if(month<10){
month_alt <- paste0("0",month) # months 1-9 should show up as 01-09
}
Newname <- paste0(year,month_alt,'_','data_sel')
assign(Newname, data_sel)
}
}
The output I am looking to get is below (separate objects containing data from a given year-month):
> ls()
[1] "201201_data_sel" "201202_data_sel" "201203_data_sel" "201204_data_sel"
[5] "201205_data_sel" "201206_data_sel" "201207_data_sel" "201208_data_sel"
[9] "201209_data_sel" "201301_data_sel" "201302_data_sel" "201303_data_sel"
[13] "201304_data_sel" "201305_data_sel" "201306_data_sel" "201307_data_sel"
[17] "201308_data_sel" "201309_data_sel" "201401_data_sel" "201402_data_sel"
[21] "201403_data_sel" "201404_data_sel" "201405_data_sel" "201406_data_sel"
[25] "201407_data_sel" "201408_data_sel" "201409_data_sel" "201501_data_sel"
[29] "201502_data_sel" "201503_data_sel" "201504_data_sel" "201505_data_sel"
[33] "201506_data_sel" "201507_data_sel" "201508_data_sel" "201509_data_sel"
[37] "201601_data_sel" "201602_data_sel" "201603_data_sel" "201604_data_sel"
[41] "201605_data_sel" "201606_data_sel" "201607_data_sel" "201608_data_sel"
[45] "201609_data_sel" "data" "data_sel" "month"
[49] "month_alt" "months" "Newname" "year"
[53] "years"
You could do:
library(dplyr)
g <- data %>%
mutate(month = sprintf("%02d", month)) %>%
group_by(year, month)
setNames(group_split(g), with(group_keys(g), paste0("data_sel_", year, month))) %>%
list2env(envir = .GlobalEnv)
Starting an object name with a digit is not allowed in R, so in paste0 "data_sel_" is first.
As mentioned in the comments it might be better to not pipe to list2env and store the output as a list with named elements.
I'm looking to scrape the third table off of this website and store it as a data frame. Below is a reproducible example
The third table is the one with "Isiah YOUNG" in the first row, third column.
library(rvest)
library(dplyr)
target_url <-
"https://flashresults.com/2017_Meets/Outdoor/06-22_USATF/004-2-02.htm"
table <- target_url %>%
read_html(options = c("DTDLOAD")) %>%
html_nodes("[id^=splitevents]") # this is the correct node
So far so good. Printing table[[1]] shows the contents I want.
table[[1]]
{html_node}
<table id="splitevents" class="sortable" align="center">
[1] <tr>\n<th class="sorttable_nosort" width="20">Pl</th>\n<th class="sorttable_nosort" width="20">Ln</th>\n<th ...
[2] <td>1</td>\n
[3] <td>6</td>\n
[4] <td></td>\n
[5] <td>Isiah YOUNG</td>\n
[6] <td></td>\n
[7] <td>NIKE</td>\n
[8] <td>20.28 Q</td>\n
[9] <td><b><font color="grey">0.184</font></b></td>
[10] <td>2</td>\n
[11] <td>7</td>\n
[12] <td></td>\n
[13] <td>Elijah HALL-THOMPSON</td>\n
[14] <td></td>\n
[15] <td>Houston</td>\n
[16] <td>20.50 Q</td>\n
[17] <td><b><font color="grey">0.200</font></b></td>
[18] <td>3</td>\n
[19] <td>9</td>\n
[20] <td></td>\n
...
However, passing this to html_table results in an empty data frame.
table[[1]] %>%
html_table(fill = TRUE)
[1] Pl Ln Athlete Affiliation Time
<0 rows> (or 0-length row.names)
How can I get the contents of table[[1]] (which clearly do exist) as a data frame?
The html is full of errors and tripping up the parser and I haven't seen any easy way to fix these.
An alternative way, in this particular scenario, is to use the header count to determine the appropriate column count, then derive the row count by dividing the total td count by the number of columns; use these to convert into a matrix then dataframe.
library(rvest)
library(dplyr)
target_url <- "https://flashresults.com/2017_Meets/Outdoor/06-22_USATF/004-2-02.htm"
table <- read_html(target_url) %>%
html_node("#splitevents")
tds <- table %>% html_nodes('td') %>% html_text()
ths <- table %>% html_nodes("th") %>% html_text()
num_col <- length(ths)
num_row <- length(tds) / num_col
df <- tds %>%
matrix(nrow = num_row, ncol = num_col, byrow = TRUE) %>%
data.frame() %>%
setNames(ths)
I am unable to extract links of images from a website.
I am new to data scraping. I have used Selectorgadget as well as inspect element method to get the class of the image, but to no avail.
main.page <- read_html(x= "https://www.espncricinfo.com/series/17213/scorecard/64951/england-vs-india-1st-odi-india-tour-of-england-1974")
urls <- main.page %>%
html_nodes(".match-detail--item:nth-child(9) .lazyloaded") %>%
html_attr("src")
sotu <- data.frame(urls = urls)
I am getting the following output:
<0 rows> (or 0-length row.names)
Certain classes and parameters don't show up in the scraped data for some reason. Just target img instead of .lazyloaded and data-src instead of src:
library(rvest)
main.page <- read_html("https://www.espncricinfo.com/series/17213/scorecard/64951/england-vs-india-1st-odi-india-tour-of-england-1974")
main.page %>%
html_nodes(".match-detail--item:nth-child(9) img") %>%
html_attr("data-src")
#### OUTPUT ####
[1] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/1.png&h=25&w=25"
[2] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[3] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[4] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[5] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[6] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[7] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[8] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[9] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[10] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[11] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
[12] "https://a1.espncdn.com/combiner/i?img=/i/teamlogos/cricket/500/6.png&h=25&w=25"
As the DOM is modified via javascript (using React) when using browser you don't get the same layout for rvest. You could, less optimal, regex out the info from the javascript object the links are housed in. Then use a json parser to extract the links
library(rvest)
library(jsonlite)
library(stringr)
library(magrittr)
url <- "https://www.espncricinfo.com/series/17213/scorecard/64951/england-vs-india-1st-odi-india-tour-of-england-1974"
r <- read_html(url) %>%
html_nodes('body') %>%
html_text() %>%
toString()
x <- str_match_all(r,'debuts":(.*?\\])')
json <- jsonlite::fromJSON(x[[1]][,2])
print(json$imgicon)
I'm trying to use R to fetch all the links to data files on the Eurostat website. While my code currently "works", I seem to get a duplicate result for every link.
Note, the use of download.file is to get around my company's firewall, per this answer
library(dplyr)
library(rvest)
myurl <- "http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?dir=data&sort=1&sort=2&start=all"
download.file(myurl, destfile = "eurofull.html")
content <- read_html("eurofull.html")
links <- content %>%
html_nodes("a") %>% #Note that I dont know the significance of "a", this was trial and error
html_attr("href") %>%
data.frame()
# filter to only get the ".tsv.gz" links
files <- filter(links, grepl("tsv.gz", .))
Looking at the top of the dataframe
files$.[1:6]
[1] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_ali01.tsv.gz
[2] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_ali01.tsv.gz
[3] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_ali02.tsv.gz
[4] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_ali02.tsv.gz
[5] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_eaa01.tsv.gz
[6] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_eaa01.tsv.gz
The only difference between 1 and 2 is that 1 says "...file=data..." while 2 says "...downfile=data...". This pattern continues for all pairs down the dataframe.
If I download 1 and 2 and read the files into R, an identical check confirms they are the same.
Why are two links to the same data being returned? Is there a way (other than filtering for "downfile") to only return one of the links?
As noted, you can just do some better node targeting. This uses XPath vs CSS selectors and picks the links with downfile in the href:
html_nodes(content, xpath = ".//a[contains(#href, 'downfile')]") %>%
html_attr("href") %>%
sprintf("http://ec.europa.eu/%s", .) %>%
head()
## [1] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_ali01.tsv.gz"
## [2] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_ali02.tsv.gz"
## [3] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa01.tsv.gz"
## [4] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa02.tsv.gz"
## [5] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa03.tsv.gz"
## [6] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa04.tsv.gz"