Information lost by html_table

Information lost by html_table - r

I'm looking to scrape the third table off of this website and store it as a data frame. Below is a reproducible example
The third table is the one with "Isiah YOUNG" in the first row, third column.
library(rvest)
library(dplyr)
target_url <-
"https://flashresults.com/2017_Meets/Outdoor/06-22_USATF/004-2-02.htm"
table <- target_url %>%
read_html(options = c("DTDLOAD")) %>%
html_nodes("[id^=splitevents]") # this is the correct node
So far so good. Printing table[[1]] shows the contents I want.
table[[1]]
{html_node}
<table id="splitevents" class="sortable" align="center">
[1] <tr>\n<th class="sorttable_nosort" width="20">Pl</th>\n<th class="sorttable_nosort" width="20">Ln</th>\n<th ...
[2] <td>1</td>\n
[3] <td>6</td>\n
[4] <td></td>\n
[5] <td>Isiah YOUNG</td>\n
[6] <td></td>\n
[7] <td>NIKE</td>\n
[8] <td>20.28 Q</td>\n
[9] <td><b><font color="grey">0.184</font></b></td>
[10] <td>2</td>\n
[11] <td>7</td>\n
[12] <td></td>\n
[13] <td>Elijah HALL-THOMPSON</td>\n
[14] <td></td>\n
[15] <td>Houston</td>\n
[16] <td>20.50 Q</td>\n
[17] <td><b><font color="grey">0.200</font></b></td>
[18] <td>3</td>\n
[19] <td>9</td>\n
[20] <td></td>\n
...
However, passing this to html_table results in an empty data frame.
table[[1]] %>%
html_table(fill = TRUE)
[1] Pl Ln Athlete Affiliation Time
<0 rows> (or 0-length row.names)
How can I get the contents of table[[1]] (which clearly do exist) as a data frame?

The html is full of errors and tripping up the parser and I haven't seen any easy way to fix these.
An alternative way, in this particular scenario, is to use the header count to determine the appropriate column count, then derive the row count by dividing the total td count by the number of columns; use these to convert into a matrix then dataframe.
library(rvest)
library(dplyr)
target_url <- "https://flashresults.com/2017_Meets/Outdoor/06-22_USATF/004-2-02.htm"
table <- read_html(target_url) %>%
html_node("#splitevents")
tds <- table %>% html_nodes('td') %>% html_text()
ths <- table %>% html_nodes("th") %>% html_text()
num_col <- length(ths)
num_row <- length(tds) / num_col
df <- tds %>%
matrix(nrow = num_row, ncol = num_col, byrow = TRUE) %>%
data.frame() %>%
setNames(ths)

Related

How to download data from the Reptile database using r

I am using R to try and download images from the Reptile-database by filling their form to seek for specific images. For that, I am following previous suggestions to fill a form online from R, such as:
library(httr)
library(tidyverse)
POST(
url = "http://reptile-database.reptarium.cz/advanced_search",
encode = "json",
body = list(
genus = "Chamaeleo",
species = "dilepis"
)) -> res
out <- content(res)[1]
This seems to work smoothly, but my problem now is to identify the link with the correct species name in the resulting out object.
This object should contain the following page:
https://reptile-database.reptarium.cz/species?genus=Chamaeleo&species=dilepis&search_param=%28%28genus%3D%27Chamaeleo%27%29%28species%3D%27dilepis%27%29%29
This contains names with links. Thus, i would like to identify the link that takes me to the page with the correct species's table. however I am unable to find the link nor even the name of the species within the generated out object.

Here I only extract the links to the pictures. Simply map or apply a function to download them with download.file()
library(tidyverse)
library(rvest)
genus <- "Chamaeleo"
species <- "dilepis"
pics <- paste0(
"http://reptile-database.reptarium.cz/species?genus=", genus,
"&species=", species) %>%
read_html() %>%
html_elements("#gallery img") %>%
html_attr("src")
[1] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000034021_01_t.jpg"
[2] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000033342_01_t.jpg"
[3] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029987_01_t.jpg"
[4] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029988_01_t.jpg"
[5] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035130_01_t.jpg"
[6] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035131_01_t.jpg"
[7] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035132_01_t.jpg"
[8] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035133_01_t.jpg"
[9] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036237_01_t.jpg"
[10] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036238_01_t.jpg"
[11] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036239_01_t.jpg"
[12] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041048_01_t.jpg"
[13] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041049_01_t.jpg"
[14] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041050_01_t.jpg"
[15] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041051_01_t.jpg"
[16] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042287_01_t.jpg"
[17] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042288_01_t.jpg"
[18] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0070.jpeg"
[19] "https://calphotos.berkeley.edu/imgs/128x192/1338_3161/0662/0074.jpeg"
[20] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0082.jpeg"
[21] "https://calphotos.berkeley.edu/imgs/128x192/1338_3152/3386/0125.jpeg"
[22] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/1009/0136.jpeg"
[23] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/0210/0057.jpeg"

Inserting string into the middle of a URL in R

I am using rvest to scrape an IMDB list and want to access the list of full cast and crew. Unfortunately, IMDB has created a summary page when you click on the title and it takes me to the wrong page.
This is the webpage I get: https://www.imdb.com/title/tt1375666/?ref_=ttls_li_tt
This is the webpage I need: https://www.imdb.com/title/tt1375666/fullcredits/?ref_=tt_ql_cl
Notice the addition of the /fullcredits in the URL.
How can I insert /fullcredits into the middle of a URL I have built?
#install.packages("rvest")
#install.packages("dplyr")
library(rvest) #webscraping package
library(dplyr) #piping
link = "https://www.imdb.com/list/ls006266261/?st_dt=&mode=detail&page=1&sort=list_order,asc"
credits = "fullcredits/"
page = read_html(link)
name <- page %>% rvest::html_nodes(".lister-item-header a") %>% rvest::html_text()
movie_link = page %>% rvest::html_nodes(".lister-item-header a") %>% html_attr("href") %>% paste("https://www.imdb.com", ., sep="")

Here is an option - get the dirname and basename from the link, replace the substring of the basename with new substring ("tt_ql_cl") and join them again with file.path after inserting the "fullcredits" in between
library(stringr)
movie_link2 <- file.path(dirname(movie_link), "fullcredits",
str_replace(basename(movie_link), "ttls_li_tt", "tt_ql_cl"))
-output
> head(movie_link2)
[1] "https://www.imdb.com/title/tt0068646/fullcredits/?ref_=tt_ql_cl"
[2] "https://www.imdb.com/title/tt0099685/fullcredits/?ref_=tt_ql_cl"
[3] "https://www.imdb.com/title/tt0110912/fullcredits/?ref_=tt_ql_cl"
[4] "https://www.imdb.com/title/tt0114814/fullcredits/?ref_=tt_ql_cl"
[5] "https://www.imdb.com/title/tt0078788/fullcredits/?ref_=tt_ql_cl"
[6] "https://www.imdb.com/title/tt0117951/fullcredits/?ref_=tt_ql_cl"
> tail(movie_link2)
[1] "https://www.imdb.com/title/tt0144084/fullcredits/?ref_=tt_ql_cl"
[2] "https://www.imdb.com/title/tt0119654/fullcredits/?ref_=tt_ql_cl"
[3] "https://www.imdb.com/title/tt0477348/fullcredits/?ref_=tt_ql_cl"
[4] "https://www.imdb.com/title/tt0080339/fullcredits/?ref_=tt_ql_cl"
[5] "https://www.imdb.com/title/tt0469494/fullcredits/?ref_=tt_ql_cl"
[6] "https://www.imdb.com/title/tt1375666/fullcredits/?ref_=tt_ql_cl"

Another way,
df1 = gsub("\\?.*", "", movie_link)
df = paste0(df1, 'fullcredits/?ref_=tt_ql_cl')
df
[1] "https://www.imdb.com/title/tt0068646/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0099685/fullcredits/?ref_=tt_ql_cl"
[3] "https://www.imdb.com/title/tt0110912/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0114814/fullcredits/?ref_=tt_ql_cl"
[5] "https://www.imdb.com/title/tt0078788/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0117951/fullcredits/?ref_=tt_ql_cl"
[7] "https://www.imdb.com/title/tt0137523/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0108052/fullcredits/?ref_=tt_ql_cl"
[9] "https://www.imdb.com/title/tt0118749/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0105236/fullcredits/?ref_=tt_ql_cl"
[11] "https://www.imdb.com/title/tt0111161/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0073195/fullcredits/?ref_=tt_ql_cl"
[13] "https://www.imdb.com/title/tt0075314/fullcredits/?ref_=tt_ql_cl" "https://www.imdb.com/title/tt0119488/fullcredits/?ref_=tt_ql_cl"

Sort list files by semester and year

I am trying to sort a list files in a directory, I used differents libraries but all gave me the same result, example:
myFiles <- paste0("Archivo_", c(1:2),"S",rep(c(2010:2015), each=2), ".txt")
# install.packages ('gtools')
library ('gtools')
mixedsort(myFiles)
sort(myFiles)
# install.packages("naturalsort")
library("naturalsort")
naturalsort(myFiles)
[1] "Archivo_1S2010.txt" "Archivo_1S2011.txt" "Archivo_1S2012.txt"
[4] "Archivo_1S2013.txt" "Archivo_1S2014.txt" "Archivo_1S2015.txt"
[7] "Archivo_2S2010.txt" "Archivo_2S2011.txt" "Archivo_2S2012.txt"
[10] "Archivo_2S2013.txt" "Archivo_2S2014.txt" "Archivo_2S2015.txt"
I would like get
myFiles
"Archivo_1S2010.txt" "Archivo_2S2010.txt" "Archivo_1S2011.txt"
"Archivo_2S2011.txt" "Archivo_1S2012.txt" "Archivo_2S2012.txt"
"Archivo_1S2013.txt" "Archivo_2S2013.txt" "Archivo_1S2014.txt"
"Archivo_2S2014.txt" "Archivo_1S2015.txt" "Archivo_2S2015.txt"

library(dplyr)
myFiles %>%
tibble(archivo=.) %>%
mutate(archivo_ref = gsub("_\\dS", "", archivo)) %>%
arrange(archivo_ref) %>%
select(archivo) %>%
unlist %>%
unname
[1] "Archivo_1S2010.txt" "Archivo_2S2010.txt" "Archivo_1S2011.txt" "Archivo_2S2011.txt"
[5] "Archivo_1S2012.txt" "Archivo_2S2012.txt" "Archivo_1S2013.txt" "Archivo_2S2013.txt"
[9] "Archivo_1S2014.txt" "Archivo_2S2014.txt" "Archivo_1S2015.txt" "Archivo_2S2015.txt"

How to create and save subset dataframes for sequence of year-month

I would like to filter from a dataframe observations for a given year-month and then save it as a separate dataframe and name it with the respective year-month.
I would be grateful if someone could suggest a more efficient code than the one below. Also, this code is not filtering correctely the observations.
data <- data.frame(year = c(rep(2012,12),rep(2013,12),rep(2014,12),rep(2015,12),rep(2016,12)),
month = rep(1:12,5),
info = seq(60)*100)
years <- 2012:2016
months <- 1:12
for(year in years){
for(month in months){
data_sel <- data %>%
filter(year==year & month==month)
if(month<10){
month_alt <- paste0("0",month) # months 1-9 should show up as 01-09
}
Newname <- paste0(year,month_alt,'_','data_sel')
assign(Newname, data_sel)
}
}
The output I am looking to get is below (separate objects containing data from a given year-month):
> ls()
[1] "201201_data_sel" "201202_data_sel" "201203_data_sel" "201204_data_sel"
[5] "201205_data_sel" "201206_data_sel" "201207_data_sel" "201208_data_sel"
[9] "201209_data_sel" "201301_data_sel" "201302_data_sel" "201303_data_sel"
[13] "201304_data_sel" "201305_data_sel" "201306_data_sel" "201307_data_sel"
[17] "201308_data_sel" "201309_data_sel" "201401_data_sel" "201402_data_sel"
[21] "201403_data_sel" "201404_data_sel" "201405_data_sel" "201406_data_sel"
[25] "201407_data_sel" "201408_data_sel" "201409_data_sel" "201501_data_sel"
[29] "201502_data_sel" "201503_data_sel" "201504_data_sel" "201505_data_sel"
[33] "201506_data_sel" "201507_data_sel" "201508_data_sel" "201509_data_sel"
[37] "201601_data_sel" "201602_data_sel" "201603_data_sel" "201604_data_sel"
[41] "201605_data_sel" "201606_data_sel" "201607_data_sel" "201608_data_sel"
[45] "201609_data_sel" "data" "data_sel" "month"
[49] "month_alt" "months" "Newname" "year"
[53] "years"

You could do:
library(dplyr)
g <- data %>%
mutate(month = sprintf("%02d", month)) %>%
group_by(year, month)
setNames(group_split(g), with(group_keys(g), paste0("data_sel_", year, month))) %>%
list2env(envir = .GlobalEnv)
Starting an object name with a digit is not allowed in R, so in paste0 "data_sel_" is first.
As mentioned in the comments it might be better to not pipe to list2env and store the output as a list with named elements.

Joined columns not visible in final result using dplyr

I am a newbie in R, I have the following code for doing some aggregations on the movie lens dataset in R using dplyr
joined_data <- inner_join(ratings_data,movie_data,by="movie_id",copy=TRUE)
data <- joined_data %>% group_by(movie_id) %>% arrange(movie_id)
data1 <- data %>% select(movie_id,movie_title,rating) %>% summarize(count_ratings=n())
The data tbl has all the columns I want(movie_id,movie_title,rating,...) I'm trying to select only 3 columns and summarize them, but the data1 tbl does not have the movie_title which was from the second table(movie_data). Any reason why this is happening? How do I get the columns I want in data1?
names(data)
[1] "user_id" "movie_id" "rating" "timestamp" "movie_title"
[6] "release_date" "video_release.date" "IMDb_URL" "unknown" "Action"
[11] "Adventure" "Animation" "Childrens" "Comedy" "Crime"
[16] "Documentary" "Drama" "Fantasy" "Film_Noir" "Horror"
[21] "Musical" "Mystery" "Romance" "Sci_Fi" "Thriller"
[26] "War" "Western"
But when I do this :
data1 <- data %>% select(movie_id,movie_title,user_id,rating) %>% summarize(count_users=n(),count_ratings=n())
names(data1)
[1] "movie_id" "count_users" "count_ratings"

group_by(movie_id) in your second line is responsible for that. Can you use:
group_by(movie_id, movie_title)
and check again - this worked as suggested by #AntoniosK

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Information lost by html_table - r

Related

How to download data from the Reptile database using r

Inserting string into the middle of a URL in R

Sort list files by semester and year

How to create and save subset dataframes for sequence of year-month

Joined columns not visible in final result using dplyr

Categories

Resources