Webscraping using R - Table content - r

New to webscraping. I am trying to scrape specific data from websites.
For eg. https://www.vesselfinder.com/vessels/KOTA-CARUM-IMO-9494577-MMSI-563150100
I need to scrape the distance the ship has travelled in 2020 and 2021.
shipws <- read_html(shipsite)
The above code gets me the site. shipsite is the url.
Now, I tried using,
a <- shipws %>%
html_nodes( css = "_1hFrZ") %>%
html_attr()
But it returns a empty. _1hFrZ was the td class in the website. It returns empty when I use html_text() too.
a <- shipsite %>%
html() %>%
html_nodes(xpath='//*[#id="tbc1"]/div[1]/div[1]/table') %>%
html_table()
Few tutorials asked me to do it above way and that turned up with errors that html() function does not exist. If I remove html()
Would love to know where I am going wrong. Thank you.

We can just get all the tables from website by,
df = 'https://www.vesselfinder.com/vessels/KOTA-CARUM-IMO-9494577-MMSI-563150100' %>%
read_html() %>% html_table()
The table of interest is,
df[[2]]
# A tibble: 4 x 2
X1 X2
<chr> <int>
1 Travelled distance (nm) 98985
2 Port Calls 54
3 Average / Max Speed (kn) NA
4 Min / Max Draught (m) NA

Related

R Web scraping data from StockTwits website

I want to get some information from tweets posted on the platform StockTwits. Here you can see an example tweet: https://stocktwits.com/3726859/message/469518468
I have already asked the same question once (R How to web scrape data from StockTwits with RSelenium?), but the StockTwits website has been changed and I can no longer work with the same html_nodes() command.
I would therefore be very happy if someone could help me with the input in the html_nodes() function.
I would like to read the following information: Number of replies, number of reshares, number of likes:
I have got this far so far:
library(rvest)
read_html("https://stocktwits.com/SunAndStorm/message/499613811") |>
html_nodes()
The final result should be a dataframe, which should look like this:
# A tibble: 1 × 5
Reply Reshare Like Share Search
<lgl> <lgl> <lgl> <lgl> <lgl>
5 0 1 0 0
Look into the network section in the developer tools and you'd find their API. Call on it with a tweet ID of interest.
I composed a start for you here. I couldn't find reshares and search. but I am sure it is there somewhere. Since you have thousand of tweets to gather info on, this method is more efficient.
library(tidyverse)
library(httr2)
get_stockwits <- function(id) {
data <-
str_c("https://api.stocktwits.com/api/2/messages/", id, "/conversation.json?limit=21") %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE)
tibble(
tweet = data %>%
getElement("message") %>%
getElement("body"),
reply = data %>%
getElement("message") %>%
getElement("conversation") %>%
getElement("replies"),
likes = data %>%
getElement("message") %>%
getElement("likes") %>%
getElement("total"),
comments = data %>%
getElement("children") %>%
getElement("messages") %>%
getElement("body")
) %>%
nest(comments = comments)
}
get_stockwits(469518468)
# A tibble: 1 x 4
tweet reply likes comments
<chr> <int> <int> <list>
1 $GME going back in all this month 5 1 <tibble [2 x 1]>
Unnest comments to see the comments
get_stockwits(469518468) %>%
unnest(comments)
# A tibble: 2 x 4
tweet reply likes comments
<chr> <int> <int> <chr>
1 $GME going back in all this month 5 1 #okkenny yeah with options
2 $GME going back in all this month 5 1 #okkenny playing monthly only
I do not use the html nodes, but find the element with the xpath. Folowing code gives you the information you need
url <- "https://stocktwits.com/SunAndStorm/message/499613811"
# Set up driver
driver <- rsDriver(browser = "firefox", chromever = NULL)
remDr <- driver[["client"]]
# Go to site
remDr$navigate(url)
# Extract information using xpath
info <- remDr$findElement(using = "xpath", "/html/body/div[2]/div/div[2]/div[2]/div[2]/div/div/div/div[1]/div[1]/div/div[2]/article/div/div[5]")
Then you can use getelementtext to find the information
> info$getElementText()
[[1]]
[1] "4Comments\n0Reshares\n7Likes"
If you need help converting this string to a dataframe let me know and I can help you out, but I assume this is not the main problem.
Kind regerads

counting word frequency in a string across columns in R

I am trying to get a count of how many times each word appears total for every index of a column for my whole data set. The data can be found here:https://www.kaggle.com/tovarischsukhov/southparklines
My code is as follows:
SP = read.csv("All-seasons.csv")
SP$Season = as.numeric(SP$Season)
SP$Episode = as.numeric(SP$Episode)
Cartman = SP %>% group_by(Character) %>%
arrange(Season, Episode) %>%
filter(Character =="Cartman")
Cartman_text_tbl <- as_tibble(data.frame(uniqueID = 1:length(Cartman$Season),Cartman[1:length(Cartman$Season),]))
Cartman_text_tbl_words <- Cartman_text_tbl %>% select(uniqueID,Cartman$Line) %>%
unnest_tokens(word, Cartman$Line) %>% filter(str_detect(word,"^[a-z']+$")) %>%
group_by(uniqueID) %>% count(word)
When I run the last line of code I get this error:
Error in `select()`:
! Can't subset columns that don't exist.
x Columns `Yeah, go home you little dildo.\n`, `I know what it means!\n`, `I'm not telling you.\n`, `He-yeah, that's what Kyle's little brother is all right! Ow! \n`, `That's 'cause I was having these... bogus nightmares.\n`, etc. don't exist.
I did a project for a class a couple of years ago where the professor provided some similar code, I am trying to format this code off what was previously provided for me. If there is a better way to get a count that would be awesome to know about as well, otherwise a way to fix the error would be great. Additionally, each line ends with a "\n" I was wondering if its possible to remove those from every column? Thanks!
If I understand you correctly, I believe this may help you. The output gives you the count of each word said by Cartman for each episode and season. Of course for other characters you can use the same code and change the filter and object the output is assigned to. Also if you need to remove stop words you can add anti_join(stop_words, by = "word") %>% after the unnest_tokens() function. It is also set as sort = TRUE, so it will sort the words in descending order based on frequency, so you can change this and sort as needed.
Code:
library(tidyverse)
library(tidytext)
df <- read_csv("All-seasons.csv")
cartman <- df %>%
filter(Character == "Cartman") %>%
group_by(Season, Episode) %>%
unnest_tokens(output = word, input = Line) %>%
count(word, sort = TRUE)
Output Example:
> head(cartman)
# A tibble: 6 x 4
# Groups: Season, Episode [6]
Season Episode word n
<dbl> <dbl> <chr> <int>
1 7 11 you 73
2 11 8 i 73
3 5 4 you 66
4 16 7 you 63
5 14 8 i 61
6 11 2 i 60

Is there a function similar to read_html() that can be used on data table or data frame types in R?

I'm attempting to webscrape from footballdb.com to get data related to NFL player injuries for a model I am creating from links such as this: https://www.footballdb.com/transactions/injuries.html?yr=2016&wk=1&type=reg which will then be output in a data table. Along with data related to individual player injury information (i.e. their name, injury, and status throughout the week leading up to the game), I also want to include the season and week of the injury in question for each player. I started by using nested for loops to generate the url for each webpage in question, along with the season and week corresponding to each webpage, which were stored in a data table with columns: link, season, and week.
I then tried to to use the functions map_df(), read_html(), and html_nodes() to extract the information I wanted from each webpage, but I run into errors as read_html() does not work for for objects of the data table or data frame class. I then tried to use different types of indexing and the $ operator with no luck either. Is there anyway I can modify the code I have produced thus far to extract the information I want from a data table? Below is what I have written thus far:
library(purrr)
library(rvest)
library(data.table)
#Remove file if file already exists
if (file.exists("./project/volume/data/interim/injuryreports.csv")) {
file.remove("./project/volume/data/interim/injuryreports.csv")}
#Declare variables and empty data tables
path1<-("https://www.footballdb.com/transactions/injuries.html?yr=")
seasons<-c("2016", "2017", "2020")
weeks<-1:17
result<-data.table()
temp<-NULL
#Use nested for loops to get the url, season, and week for each webpage of interest, store in result data table
for(s in 1:length(seasons)){
for(w in 1:length(weeks)){
temp$link<- paste0(path1, seasons[s],"&wk=", as.character(w), "&type=reg")
temp$season<-as.numeric(seasons[s])
temp$week<-weeks[w]
result<-rbind(result,temp)
}
}
#Get rid of any potential empty values from result
result<-compact(result)
###Errors Below####
DT <- map_df(result, function(x){
page <- read_html(x[[1]])
data.table(
Season = x[[2]],
Week = x[[3]],
Player = page %>% html_nodes('.divtable .td:nth-child(1) b') %>% html_text(),
Injury = page %>% html_nodes('.divtable .td:nth-child(2)') %>% html_text(),
Wed = page %>% html_nodes('.divtable .td:nth-child(3)') %>% html_text(),
Thu = page %>% html_nodes('.divtable .td:nth-child(4)') %>% html_text(),
Fri = page %>% html_nodes('.divtable .td:nth-child(5)') %>% html_text(),
GameStatus = page %>% html_nodes('.divtable .td:nth-child(6)') %>% html_text()
)
}
)
#####End of Errors###
#Write out injury data table
fwrite(DT,"./project/volume/data/interim/injuryreports.csv")
The issue is that your input data frame result is a datatable. When passing this to map_df it will loop over the columns(!!) of the datable not the rows.
One approach to make your code work is to split result by link and loop over the resulting list.
Note: For the reprex I only loop over the first two elements of the list. Additionally I have put your function outside of the map statement which made debugging easier.
library(purrr)
library(rvest)
library(data.table)
#Declare variables and empty data tables
path1<-("https://www.footballdb.com/transactions/injuries.html?yr=")
seasons<-c("2016", "2017", "2020")
weeks<-1:17
result<-data.table()
temp<-NULL
#Use nested for loops to get the url, season, and week for each webpage of interest, store in result data table
for(s in 1:length(seasons)){
for(w in 1:length(weeks)){
temp$link<- paste0(path1, seasons[s],"&wk=", as.character(w), "&type=reg")
temp$season<-as.numeric(seasons[s])
temp$week<-weeks[w]
result<-rbind(result,temp)
}
}
#Get rid of any potential empty values from result
result<-compact(result)
result <- split(result, result$link)
get_table <- function(x) {
page <- read_html(x[[1]])
data.table(
Season = x[[2]],
Week = x[[3]],
Player = page %>% html_nodes('.divtable .td:nth-child(1) b') %>% html_text(),
Injury = page %>% html_nodes('.divtable .td:nth-child(2)') %>% html_text(),
Wed = page %>% html_nodes('.divtable .td:nth-child(3)') %>% html_text(),
Thu = page %>% html_nodes('.divtable .td:nth-child(4)') %>% html_text(),
Fri = page %>% html_nodes('.divtable .td:nth-child(5)') %>% html_text(),
GameStatus = page %>% html_nodes('.divtable .td:nth-child(6)') %>% html_text()
)
}
DT <- map_df(result[1:2], get_table)
DT
#> Season Week Player Injury Wed Thu Fri
#> 1: 2016 1 Justin Bethel Foot Limited Limited Limited
#> 2: 2016 1 Lamar Louis Knee DNP Limited Limited
#> 3: 2016 1 Kareem Martin Knee DNP DNP DNP
#> 4: 2016 1 Alex Okafor Biceps Full Full Full
#> 5: 2016 1 Frostee Rucker Neck Limited Limited Full
#> ---
#> 437: 2016 10 Will Blackmon Thumb Limited Limited Limited
#> 438: 2016 10 Duke Ihenacho Concussion Full Full Full
#> 439: 2016 10 DeSean Jackson Shoulder DNP DNP DNP
#> 440: 2016 10 Morgan Moses Ankle Limited Limited Limited
#> 441: 2016 10 Brandon Scherff Shoulder Full Full Full
#> GameStatus
#> 1: (09/09) Questionable vs NE
#> 2: (09/09) Questionable vs NE
#> 3: (09/09) Out vs NE
#> 4: --
#> 5: --
#> ---
#> 437: (11/11) Questionable vs Min
#> 438: (11/11) Questionable vs Min
#> 439: (11/11) Doubtful vs Min
#> 440: (11/11) Questionable vs Min
#> 441: --

Webscrape using for loop into data frame in R

I feel like I am close with this but cannot find the right solution. I want to scrape tables from multiple pages and save the results into one final data frame. All the tables will have the same structure. My code is below with a sample of the loop (realistically there are potentially 1,000 pages). When I run the code on a single page I can get the result but I cannot figure out the loop or how to save the loop results into a data frame. See what I am doing below, any help appreciated!!
library(textreadr)
library(dplyr)
library(rvest)
for (event in (803:806)){
url<-paste0('http://profightdb.com/cards/wwf/monday-night-raw-', event,'.html')
webpage<-read_html(url)
tbls_ls<-webpage %>%
html_nodes('table') %>%
.[[2]] %>%
html_table(fill=TRUE)
}
Perhaps save the results as a list of data frames.
library(textreadr)
library(dplyr)
library(rvest)
tbls_ls <- vector(4, mode="list") # Initialize the list
i <- 1 # Initialize the index
for (event in (803:806)){
url <- paste0('http://profightdb.com/cards/wwf/monday-night-raw-', event,'.html')
webpage <- read_html(url)
tbls_ls[[i]] <- webpage %>%
html_nodes('table') %>%
.[[2]] %>%
html_table(fill=TRUE)
i <- i+1 # Update the index
}
class(tbls_ls) # "list"
names(tlbs_ls) <- 803:806 # Name the elements
tbls_ls[1]
$`803`
no. match match match duration
1 1 Yokozuna def. (pin) Koko B Ware 03:45
2 2 Rick Steiner & Scott Steiner def. (pin) Executioner #1 & Executioner #2 03:00
3 3 Shawn Michaels (c) def. (pin) Max Moon 10:30
4 4 The Undertaker def. (pin) Damien Demento 02:26

How to scrape multiple tables that are without IDs or Class using R

I'm trying to scrape this webpage using R : http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all (All the pages)
I'm new to programming. And everywhere I've looked, tables are mostly identified with IDs or Divs or Class. On this page there's none. Data is stored in Table format. How should I scrape it?
This is what I did :
library(rvest)
webpage <- read_html("http://zipnet.in/index.php
page=missing_mobile_phones_search&criteria=browse_all")
tbls <- html_nodes(webpage, "table")
head(tbls)
tbls_ls <- webpage %>%
html_nodes("table") %>%
.[9:10] %>%
html_table(fill = TRUE)
colnames(tbls_ls[[1]]) <- c("Mobile Make", "State", "District",
"Police Station", "Status", "Mobile Type(GSM/CDMA)",
"FIR/DD/GD Dat")
You can scrape the table data by targeting the css id of each table. It looks like each page is composed of 3 different tables pasted one after another. Two of the tables have #AutoNumber15 css id while the third (in the middle) has the #AutoNumber16 css id.
I put a simple code example that should get you started in the right direction.
suppressMessages(library(tidyverse))
suppressMessages(library(rvest))
# define function to scrape the table data from a page
get_page <- function(page_id = 1) {
# default link
link <- "http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all&Page_No="
# build link
link <- paste0(link, page_id)
# get tables data
wp <- read_html(link)
wp %>%
html_nodes("#AutoNumber16, #AutoNumber15") %>%
html_table(fill = TRUE) %>%
bind_rows()
}
# get the data from the first three pages
iter_page <- 1:3
# this is just a progress bar
pb <- progress_estimated(length(iter_page))
# this code will iterate over pages 1 through 3 and apply the get_page()
# function defined earlier. The Sys.sleep() part is used to pause the code
# after each iteration so that the sever is not overloaded with requests.
map_df(iter_page, ~ {
pb$tick()$print()
df <- get_page(.x)
Sys.sleep(sample(10, 1) * 0.1)
as_tibble(df)
})
#> # A tibble: 72 x 4
#> X1 X2 X3
#> <chr> <chr> <chr>
#> 1 FIR/DD/GD Number 000165 State
#> 2 FIR/DD/GD Date 17/08/2017 District
#> 3 Mobile Type(GSM/CDMA) GSM Police Station
#> 4 Mobile Make SAMSUNG J2 Mobile Number
#> 5 Missing/Stolen Date 23/04/2017 IMEI Number
#> 6 Complainant AKEEL KHAN Complainant Contact Number
#> 7 Status Stolen/Theft Report Date/Time on ZIPNET
#> 8 <NA> <NA> <NA>
#> 9 FIR/DD/GD Number FIR No 37/ State
#> 10 FIR/DD/GD Date 17/08/2017 District
#> # ... with 62 more rows, and 1 more variables: X4 <chr>

Resources