Webscrape using for loop into data frame in R - r

I feel like I am close with this but cannot find the right solution. I want to scrape tables from multiple pages and save the results into one final data frame. All the tables will have the same structure. My code is below with a sample of the loop (realistically there are potentially 1,000 pages). When I run the code on a single page I can get the result but I cannot figure out the loop or how to save the loop results into a data frame. See what I am doing below, any help appreciated!!
library(textreadr)
library(dplyr)
library(rvest)
for (event in (803:806)){
url<-paste0('http://profightdb.com/cards/wwf/monday-night-raw-', event,'.html')
webpage<-read_html(url)
tbls_ls<-webpage %>%
html_nodes('table') %>%
.[[2]] %>%
html_table(fill=TRUE)
}

Perhaps save the results as a list of data frames.
library(textreadr)
library(dplyr)
library(rvest)
tbls_ls <- vector(4, mode="list") # Initialize the list
i <- 1 # Initialize the index
for (event in (803:806)){
url <- paste0('http://profightdb.com/cards/wwf/monday-night-raw-', event,'.html')
webpage <- read_html(url)
tbls_ls[[i]] <- webpage %>%
html_nodes('table') %>%
.[[2]] %>%
html_table(fill=TRUE)
i <- i+1 # Update the index
}
class(tbls_ls) # "list"
names(tlbs_ls) <- 803:806 # Name the elements
tbls_ls[1]
$`803`
no. match match match duration
1 1 Yokozuna def. (pin) Koko B Ware 03:45
2 2 Rick Steiner & Scott Steiner def. (pin) Executioner #1 & Executioner #2 03:00
3 3 Shawn Michaels (c) def. (pin) Max Moon 10:30
4 4 The Undertaker def. (pin) Damien Demento 02:26

Related

Webscraping using R - Table content

New to webscraping. I am trying to scrape specific data from websites.
For eg. https://www.vesselfinder.com/vessels/KOTA-CARUM-IMO-9494577-MMSI-563150100
I need to scrape the distance the ship has travelled in 2020 and 2021.
shipws <- read_html(shipsite)
The above code gets me the site. shipsite is the url.
Now, I tried using,
a <- shipws %>%
html_nodes( css = "_1hFrZ") %>%
html_attr()
But it returns a empty. _1hFrZ was the td class in the website. It returns empty when I use html_text() too.
a <- shipsite %>%
html() %>%
html_nodes(xpath='//*[#id="tbc1"]/div[1]/div[1]/table') %>%
html_table()
Few tutorials asked me to do it above way and that turned up with errors that html() function does not exist. If I remove html()
Would love to know where I am going wrong. Thank you.
We can just get all the tables from website by,
df = 'https://www.vesselfinder.com/vessels/KOTA-CARUM-IMO-9494577-MMSI-563150100' %>%
read_html() %>% html_table()
The table of interest is,
df[[2]]
# A tibble: 4 x 2
X1 X2
<chr> <int>
1 Travelled distance (nm) 98985
2 Port Calls 54
3 Average / Max Speed (kn) NA
4 Min / Max Draught (m) NA

R for loop to extract info from a file and add it into tibble?

I am not great with tidyverse so forgive me if this is a simple question. I have a bunch of files with data that I need to extract and add into distinct columns in a tibble I created.
I want the the row names to start with the file IDs which I did manage to create:
filelist <- list.fileS(pattern=".txt") # Gives me the filenames in current directory.
# The filenames are something like AA1230.report.txt for example
file_ID <- trimws(filelist, whitespace="\\..*") # Gives me the ID which is before the "report.txt"
metadata <- as_tibble(file_ID[1:181]) # create dataframe with IDs as row names for 180 files.
Now in these report files are information on species and abundance (kraken report files for those familiar with kraken) and all I need is to extract the number of reads for each domain. I can easily search up in each file the domains and number of reads that fall into that domain using something like:
sample_data <- as_tibble(read.table("AA1230.report.txt", sep="\t", header=FALSE, strip.white=TRUE))
sample_data <- rename(sample_data, Percentage=V1, Num_reads_root=V2, Num_reads_taxon=V3, Rank=V4, NCBI_ID=V5, Name=V6) # Just renaming the column headers for clarity
sample_data %>% filter(Rank=="D") # D for domain
This gives me a clear output such as:
Percentage Num_Reads_Root Num_Reads_Taxon Rank NCBI_ID Name
<dbl> <int> <int> <fct> <int> <fct>
1 75.9 60533 28 D 2 Bacteria
2 0.48 386 0 D 2759 Eukaryota
3 0.01 4 0 D 2157 Archaea
4 0.02 19 0 D 10239 Viruses
Now, I want to just grab the info in the second column and final column and save this info into my tibble so that I can get something like:
> metadata
value Bacteria_Counts Eukaryota_Counts Viruses_Counts Archaea_Counts
<chr> <int> <int> <int> <int>
1 AA1230 60533 386 19 4
2 AB0566
3 AA1231
4 AB0567
5 BC1148
6 AW0001
7 AW0002
8 BB1121
9 BC0001
10 BC0002
....with 171 more rows
I'm just having trouble coming up with a for loop to create these sample_data outputs, then from that, extract the info and place into a tibble. I guess my first loop should create these sample_data outputs so something like:
for (files in file.list()) {
>> get_domains <<
}
Then another loop to extract that info from the above loop and insert it into my metadata tibble.
Any suggestions? Thank you so much!
PS: If regular dataframes in R is better for this let me know, I have just recently learned that tidyverse is a better way to organize dataframes in R but I have to learn more about it.
You could also do:
library(tidyverse)
filelist <- list.files(pattern=".txt")
nms <- c("Percentage", "Num_reads_root", "Num_reads_taxon", "Rank", "NCBI_ID", "Name")
set_names(filelist,filelist) %>%
map_dfr(read_table, col_names = nms, .id = 'file_ID') %>%
filter(Rank == 'D') %>%
select(file_ID, Name, Num_reads_root) %>%
pivot_wider(id_cols = file_ID, names_from = Name, values_from = Num_reads_root) %>%
mutate(file_ID = str_remove(file_ID, '.txt'))
I've found that using a for loop is nice sometimes because saves all the progress along the way in case you hit an error. Then you can find the problem file and debug it or use try() but throw a warning().
library(tidyverse)
filelist <- list.files(pattern=".txt") #list files
tmp_list <- list()
for (i in seq_along(filelist)) {
my_table <- read_tsv(filelist[i]) %>% # It looks like your files are all .tsv's
rename(Percentage=V1, Num_reads_root=V2, Num_reads_taxon=V3, Rank=V4, NCBI_ID=V5, Name=V6) %>%
filter(Rank=="D") %>%
mutate(file_ID <- trimws(filelist[i], whitespace="\\..*")) %>%
select(file_ID, everything())
tmp_list[[i]] <- my_table
}
out <- bind_rows(tmp_list)
out

Sentiment Analysis By Date

I'm doing some very basic sentiment analysis on a pretty large set of data that continues to grow every day. I need to feed this data into a shiny app where I can adjust the date range. Rather than running the analysis over and over again, what I'd like to do is create a new CSV with the sum of each sentiment score by date. I'm having trouble iterating over the date though. Here's some sample data and the lapply() statement I tried that is not working.
library(tidyverse)
library(syuzhet)
library(data.table)
df <- data.frame(date = c("2021-01-18", "2021-01-18", "2021-01-18", "2021-01-17","2021-01-17", "2021-01-16", "2021-01-15", "2021-01-15", "2021-01-15"),
text = c("Some text here", "More text", "Some other words", "Just making this up", "as I go along", "hope the example helps", "thank you in advance", "I appreciate the help", "the end"))
> df
date text
1 2021-01-18 Some text here
2 2021-01-18 More text
3 2021-01-18 Some other words
4 2021-01-17 Just making this up
5 2021-01-17 as I go along
6 2021-01-16 hope the example helps
7 2021-01-15 thank you in advance
8 2021-01-15 I appreciate the help
9 2021-01-15 the end
dates_scores_df <- lapply(df, function(i){
data <- df %>%
# Filter to the unique date
filter(date == unique(df$date[i]))
# Sentiment Analysis for each date
sentiment_data <- get_nrc_sentiment(df$text)
# Convert to df
score_df <- data.frame(sentiment_data[,])
# Transpose the data frame and adjust column names
daily_sentiment_data <- transpose(score_df)
colnames(daily_sentiment_data) <- rownames(score_df)
# Add a date column
daily_sentiment_data$date <- df$date[i]
})
sentiment_scores_by_date <- do.call("rbind.data.frame", dates_scores_df)
What I'd like to get to is something like this (data here is made up and will not match the example above)
date anger anticipation disgust fear joy sadness surprise trust negative positive
2021-01-18 1 2 0 1 2 0 2 1 1 2
2021-01-17 1 2 0 2 3 3 1 2 0 1
You can try :
library(dplyr)
library(purrr)
library(syuzhet)
df %>%
split(.$date) %>%
imap_dfr(~get_nrc_sentiment(.x$text) %>%
summarise(across(.fns = sum)) %>%
mutate(date = .y, .before = 1)) -> result
result
Function lapply iterates over elements of a list. Data frame is technically a list with each column as an element of that list. So in your example you are iterating over columns rather than rows, or even dates (that seems to be your goal). Instead of lapply I'd use dplyr::group_by in combination with one of: dplyr::do, dplyr::summarize or tidyr::nest. See documentations for each function to figure out which function suits the most your need.

How to do same function on every file in a folder in R?

So I have a folder of identically formatted csv's . Let's call the folder "Folder" and the csv's:
test1.csv
test2.csv
test3.csv
......
Each csv is formatted as follows
ID date hours info
001 01/01/2019 8 xxxx
002 01/01/2019 22 xxxx
003 01/02/2019 4 xxxx
004 01/02/2019 5 xxxx
So the following works if I want one to work but how could I run and combine across all files in the folder?
totals <- df %>%
group_by(date) %>%
summarize(hour_sum = sum(hours)
So basically I want to have a dataframe which has every date in all files and the sum of the hours from ALL files.
So if 01/02/2019 appears in 3 files, I want the sum of hours for every occurence of that date in one df.
If you are willing to use the whole tidyverse set of packages, purrr gives you map_dfr, which returns a single dataframe by rbinding each dataset you read in. More info about it here.
The code would look something like this:
library(tidyverse)
list.files(path = "path_to_data", full.names = TRUE) %>%
map_dfr(read.csv) %>%
group_by(date) %>%
summarize(hour_sum = sum(hours))
Maybe you could try the code below
aggregate(
hours ~ date,
do.call(rbind, c(lapply(list.files(pattern = "test\\d+\\.csv"), read.csv), make.row.names = FALSE)),
sum
)

R scrape html table and extract background color

I am trying to scrape some data off a wikipedia table from this page:
https://en.wikipedia.org/wiki/Results_of_the_Indian_general_election,_2014 and I am interested in the table:
Summary of the 2014 Indian general election
I would also like to extract the party colors from the table.
Here's what I've tried so far:
library("rvest")
url <-
"https://en.wikipedia.org/wiki/Results_of_the_Indian_general_election,_2014"
electionstats <- read_html(url)
results <- html_nodes(electionstats, xpath='//*[#id="mw-content-text"]/div/table[79]') %>% html_table(fill = T)
party_colors <- electionstats %>%
html_nodes(xpath='//*[#id="mw-content-text"]/div/table[3]') %>%
html_table(fill = T)
Printing out party_colors does not show any info about the colors
So, I tried:
party_colors <- electionstats %>% html_nodes(xpath='//*[#id="mw-content-text"]/div/table[3]') %>%
html_nodes('tr')
Now if I print out party_colors, I get:
[1] <tr style="background-color:#E9E9E9">\n<th style="text-align:left;vertical-align:bottom;" rowspan="2"></th>\n<th style="text-align:left; ...
[2] <tr style="background-color:#E9E9E9">\n<th style="text-align:center;">No.</th>\n<th style="text-align:center;">+/-</th>\n<th style="text ...
[3] <tr>\n<td style="background-color:#FF9933"></td>\n<td style="text-align:left;"><a href="/wiki/Bharatiya_Janata_Party" title="Bharatiya J ...
[4] <tr>\n<td style="background-color:#00BFFF"></td>\n<td style="text-align:left;"><a href="/wiki/Indian_National_Congress" title="Indian Na ...
[5] <tr>\n<td style="background-color:#009900"></td>\n<td style="text-align:left;"><a href="/wiki/All_India_Anna_Dravida_Munnetra_Kazhagam" ...
and so on...
But, now, I have no idea how to pull out the colors from this data. I cannot convert the output of the above to a html_table with:
html_table(fill = T)
I get the error:
Error: html_name(x) == "table" is not TRUE
I also tried various options with html_attrs, but I have no idea what the correct attribute I should be passing is.
I even tried SelectorGadget to try and figure out the attribute, but if I select the first column of the table in question, SelectorGadget shows just "td".
I would get the table like you did and then add the color attribute as a column. The wikitable sortable class works on many pages, so get the first one and remove the second header in row 1.
electionstats <- read_html(url)
x <- html_nodes(electionstats, xpath='//table[#class="wikitable sortable"]')[[1]] %>%
html_table(fill=TRUE)
# paste names from 2nd row header and then remove
names(x)[6:14] <- paste(names(x)[6:14], x[1,6:14])
x <- x[-1,]
The colors are in the first tr/td tags and you can add it to empty column 1 or 3 (see str(x))
names(x)[3] <- "Color"
x$Color <- html_nodes(electionstats, xpath='//table[#class="wikitable sortable"][1]/tr/td[1]') %>%
html_attr("style") %>% gsub("background-color:", "", .)
## drop table footer, extra columns
x <- x[1:83, 2:14]
head(x)
Party Color Alliance Abbr. Candidates No. Candidates +/- Candidates %
2 Bharatiya Janata Party #FF9933 NDA BJP 428 -5 78.82%
3 Indian National Congress #00BFFF UPA INC 464 24 85.45%
4 All India Anna Dravida Munnetra Kazhagam #009900 ADMK 40 17 7.37%
5 All India Trinamool Congress #00FF00 AITC 131 96 24.13%
6 Biju Janata Dal #006400 BJD 21 3 3.87%
7 Shiv Sena #E3882D NDA SHS 24 11 10.68%
Looks like your xml_nodeset contains both tr and td nodes.
Deal with both trs and tds, converting to data frames:
party_colors_tr <- electionstats %>% html_nodes(xpath='//*[#id="mw-content-text"]/div/table[3]') %>% html_nodes('tr')
trs <- bind_rows(lapply(xml_attrs(party_colors_tr), function(x) data.frame(as.list(x), stringsAsFactors=FALSE)))
party_colors_td <- electionstats %>% html_nodes(xpath='//*[#id="mw-content-text"]/div/table[3]') %>% html_nodes('tr') %>% html_nodes('td')
tds <- bind_rows(lapply(xml_attrs(party_colors_td), function(x) data.frame(as.list(x), stringsAsFactors=FALSE)))
Write function for extracting styles from data frames:
library(stringi)
list_styles <- function(nodes_frame) {
get_cols <- function(x) { stri_detect_fixed(x, 'background-color') }
has_style <- which(lapply(nodes_frame$style, get_cols) == TRUE)
res <- strsplit(nodes_frame[has_style,]$style, ':')
return(res)
}
Create data frame of extracted styles:
l_trs <- list_styles(trs)
df_trs <- data.frame(do.call('rbind', l_trs)[,1], do.call('rbind', l_trs)[,2])
names(df_trs) <- c('style', 'color')
l_tds <- list_styles(tds)
df_tds <- data.frame(do.call('rbind', l_tds)[,1], do.call('rbind', l_tds)[,2])
names(df_tds) <- c('style', 'color')
Combine trs and tds frames:
final_style_frame <- do.call('rbind', list(df_trs, df_tds))
Here are the first 20 rows:
final_style_frame[1:20,]

Resources