scraping wikimedia category trees - r

I want to use R to scrape the links contained within a wikimedia category tree and the structure of the tree from here. The code below can open up all the collapsible bullet points
library(RSelenium)
rD <- rsDriver(check = FALSE)
remDr <- rD[["client"]]
remDr$navigate("https://commons.wikimedia.org/wiki/Category:Sports")
n <- 1
# n <- 10 # takes a long time to expand all bullet points
for(i in 1:n){
b <- remDr$findElements(using = "css selector", "[title='expand']")
for(i in 1:length(b)){
b[[i]]$clickElement()
}
}
... but i am struggling to build a data base that would look like...
I can get the bullet href and names using the code below, but i am struggling to find a way to indicate which level each bullet points refers too (i.e. how deep in the category tree each bullet point is)? I am thinking there might be a clever xpath method to count how many CategoryTreeChildren deep each bullet is but that is reaching well beyond my capabilities.
# for testing I manually expand the bullets for the first couple of branches
# (fully for Bulgaria women badminton, basketball) and the last possible
# branch rather than let the for loop run and run through multiple cycles.
library(tidyverse)
library(rvest)
s <- remDr$getPageSource()
d <- read_html(s[[1]]) %>%
html_nodes("div#mw-subcategories") %>%
html_nodes("div.CategoryTreeItem") %>%
html_nodes("a") %>%
map(xml_attrs) %>%
map_df(~as.list(.)) %>%
as_tibble()
# > d
# # A tibble: 135 x 2
# href title
# <chr> <chr>
# 1 /wiki/Category:Categories_by_sport Category:Categories by sport
# 2 /wiki/Category:Categories_by_sport_by_c~ Category:Categories by sport by co~
# 3 /wiki/Category:Categories_of_Bulgaria_b~ Category:Categories of Bulgaria by~
# 4 /wiki/Category:Female_sportspeople_from~ Category:Female sportspeople from ~
# 5 /wiki/Category:Female_badminton_players~ Category:Female badminton players ~
# 6 /wiki/Category:Maria_Delcheva Category:Maria Delcheva
# 7 /wiki/Category:Petya_Nedelcheva Category:Petya Nedelcheva
# 8 /wiki/Category:Gabriela_Stoeva Category:Gabriela Stoeva
# 9 /wiki/Category:Stefani_Stoeva Category:Stefani Stoeva
# 10 /wiki/Category:Women%27s_basketball_pla~ Category:Women's basketball player~
I have also played around with the WikipediR package - it says in the package description that it can be used to retrieve elements of category trees but i cannot find an example of how to implement it.

Related

R - Importing and formatting mutiple tables from various urls

I am a newbie in R so probably there is already some answer to the following question but I haven't find a solution matching the issue I am facing.
I am trying to get tables from a number of webpages. They shold be around 5200.
I have imported one in order to format it but I need to automatize the process to get them all.
Here's the url:
http://www.tbca.net.br/base-dados/int_composicao_estatistica.php?cod_produto=C0195C
I have tried to find out a way to get all the tables by doing:
url <- paste0("http://www.tbca.net.br/base-dados/int_composicao_estatistica.php?cod_produto=", ., sep="" )
but I receive an error message according which
, .,
cannot be read.
In any case, I neither get how to automatize the process of formatting once I would get it.
Any hint?
Here's how you would do it for one product:
url <- "http://www.tbca.net.br/base-dados/int_composicao_estatistica.php?cod_produto=C0195C"
h <- read_html(url)
tab <- html_table(h, fill=TRUE) %>%
as_tibble(.name_repair = "universal")
tab
# # A tibble: 37 x 1
# ...1$Componente $Unidades $`Valor por 100… $`Desvio padrão` $`Valor Mínimo` $`Valor Máximo` $`Número de dad…
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 Energia kJ 578 - - - -
# 2 Energia kcal 136 - - - -
# 3 Umidade g 65,5 - - - -
# 4 Carboidrato to… g 33,3 - - - -
# 5 Carboidrato di… g 32,5 - - - -
# 6 Proteína g 0,60 - - - -
# 7 Lipídios g 0,26 - - - -
# 8 Fibra alimentar g 0,84 - - - -
# 9 Álcool g 0,00 - - - -
# 10 Cinzas g 0,39 - - - -
# # … with 27 more rows, and 2 more variables: $Referências <chr>, $`Tipo de dados` <chr>
If you wanted to scrape all the codes and get all of the tables, you could do that with the following. First, we can set up a loop to scrape all of the links. By investigating the source, you would find, as you did, that all of the product codes have "cod_produto" in the href attribute. You could use an xpath selector to keep only those a tags containing that string. You're basically looping over every page until you get to one that doesn't have any links. This gives you 5203 links.
library(glue)
all_links <- NULL
links <- "init"
i <- 1
while(length(links) > 0){
url <- glue("http://www.tbca.net.br/base-dados/composicao_alimentos.php?pagina={i}&atuald=3")
h <- read_html(url)
links <- h %>% html_nodes(xpath = "//a[contains(#href,'cod_produto')]") %>% html_attr("href") %>% unique()
all_links <- c(all_links, links)
i <- i+1
}
EDIT
Next, we can follow each link and pull the table out of it, storing the table in the list called tabs. In answer to the question about how to get the name of the product in the data, there are two easy things to do. The first is to make the table into a data frame and then make a variable (I called it code) in the data frame that has the code name. The second is to set the list names to be the product code. The answer below has been edited to do both things.
all_links <- unique(all_links)
tabs <- vector(mode="list", length=length(all_links))
for(i in 1:length(all_links)){
url <- glue("http://www.tbca.net.br/base-dados/{all_links[i]}")
code <- gsub(".*=(.*)$", "\\1", url)
h <- read_html(url)
tmp <- html_table(h, fill=TRUE)[[1]]
tmp <- as.data.frame(tmp)
tmp$code <- code
tabs[[i]] <- tmp
names(tabs)[i] <- code
}

Cannot identify html node for scraping in rvest

trying to grab links from a page for subsequent analysis and can only grab about 1/2 of them which may be due to filtering. I'm trying to extract the links highlighted here:
My approach is as follows, which is not ideal because I believe I may be losing some links in the filter() call.
library(rvest)
library(tidyverse)
#initiate session
session <- html_session("https://www.backlisted.fm/episodes")
#collect links for all episodes from the index page:
session %>%
read_html() %>%
html_nodes(".underline-body-links a") %>%
html_attr("href") %>%
tibble(link_temp = .) %>%
filter(str_detect(link_temp, pattern = "episodes/")) %>%
distinct()
#css:
#.underline-body-links #page .html-block a, .underline-body-links #page .product-excerpt ahere
#result:
link_temp
<chr>
1 /episodes/116-mfk-fisher-how-to-cook-a-wolf
2 https://www.backlisted.fm/episodes/109-barbara-pym-excellent-women
3 /episodes/115-george-amp-weedon-grossmith-the-diary-of-a-nobody
4 https://www.backlisted.fm/episodes/27-jane-gardam-a-long-way-from-verona
5 https://www.backlisted.fm/episodes/5-b-s-johnson-christie-malrys-own-double-entry
6 https://www.backlisted.fm/episodes/97-ray-bradbury-the-illustrated-man
7 /episodes/114-william-golding-the-inheritors
8 https://www.backlisted.fm/episodes/30-georgette-heyer-venetia
9 https://www.backlisted.fm/episodes/49-anita-brookner-look-at-me
10 https://www.backlisted.fm/episodes/71-jrr-tolkien-the-return-of-the-king
# … with 43 more rows
I've been reading multiple documents but I can't target that one type of href. Any help will be much appreciated. Thank you.
Try this
library(rvest)
library(tidyverse)
session <- html_session("https://www.backlisted.fm/index")
raw_html <- read_html(session)
node <- raw_html %>% html_nodes(css = "li p a")
link <- node %>% html_attr("href")
title <- node %>% html_text()
tibble(title, link)
# A tibble: 117 x 2
# title link
# <chr> <chr>
# 1 "A Month in the Country" https://www.backlisted.fm/episodes/1-j-l-carr-a-month-in-the-country
# 2 " - J.L. Carr (with Lissa Evans)" #
# 3 "Good Morning, Midnight - Jean Rhys" https://www.backlisted.fm/episodes/2-jean-rhys-good-morning-midnight
# 4 "It Had to Be You - David Nobbs" https://www.backlisted.fm/episodes/3-david-nobbs-1
# 5 "The Blessing - Nancy Mitford" https://www.backlisted.fm/episodes/4-nancy-mitford-the-blessing
# 6 "Christie Malry's Own Double Entry - B.S. Joh… https://www.backlisted.fm/episodes/5-b-s-johnson-christie-malrys-own-dou…
# 7 "Passing - Nella Larsen" https://www.backlisted.fm/episodes/6-nella-larsen-passing
# 8 "The Great Fire - Shirley Hazzard" https://www.backlisted.fm/episodes/7-shirley-hazzard-the-great-fire
# 9 "Lolly Willowes - Sylvia Townsend Warner" https://www.backlisted.fm/episodes/8-sylvia-townsend-warner-lolly-willow…
# 10 "The Information - Martin Amis" https://www.backlisted.fm/episodes/9-martin-amis-the-information
# … with 107 more rows

R - Web Page Scraping - Trouble obtaining Attribute values using rvest

I am trying to use rvest to pull ISO country info from wikipedia ( including links from another page ) . I can't find a way of correctly obtaining the links ( href attribute ) without including the name ( I have tried xpath string function it causes an error ) . It is fairly easy to run - and self explanatory.
Any help appreciated!
library(rvest)
library(dplyr)
searchPage <- read_html("https://en.wikipedia.org/wiki/ISO_3166-2")
nodes <- html_node(searchPage, xpath = '(//h2[(span/#id = "Current_codes")]/following-sibling::table)[1]')
codes <- html_nodes(nodes, xpath = 'tr/td[1]/a/text()')
names <- html_nodes(nodes, xpath = 'tr/td[2]//a[#title]/text()')
#Following brings back data but attribute name as well
links <- html_nodes(nodes, xpath = 'tr/td[2]//a[#title]/#href')
#Following returns nothing
links2 <- html_nodes(nodes, xpath = 'tr/td[2]//a[#title]/#href/text()')
#Following Errors
links3 <- html_nodes(nodes, xpath = 'string(tr/td[2]//a[#title]/#href)')
#Following Errors
links4 <- sapply(nodes, function(x) { x %>% read_html() %>% html_nodes("tr/td[2]//a[#title]") %>% html_attr("href") })
You should have included more info in your question. The "self-explanatory"
bit nearly made me ignore the question (hint: consider providing sufficient verbal detail out of respect for others' time as well as broken code).
I say that b/c I have no idea if this is what you needed or not b/c you really didn't say.
library(rvest)
library(tibble)
pg <- read_html("https://en.wikipedia.org/wiki/ISO_3166-2")
tab <- html_node(pg, xpath=".//table[contains(., 'Zimbabwe')]")
iso_col <- html_nodes(tab, xpath=".//td[1]/a[contains(#href, 'ISO')]")
name_col <- html_nodes(tab, xpath=".//td[2]")
data_frame(
iso2c = html_text(iso_col),
iso2c_link = html_attr(iso_col, "href"),
country_name = html_text(name_col),
country_link = html_nodes(name_col, xpath=".//a[contains(#href, 'wiki')]") %>% html_attr("href")
)
## # A tibble: 249 x 4
## iso2c iso2c_link country_name country_link
## <chr> <chr> <chr> <chr>
## 1 AD /wiki/ISO_3166-2:AD Andorra /wiki/Andorra
## 2 AE /wiki/ISO_3166-2:AE United Arab Emirates /wiki/United_Arab_Emirates
## 3 AF /wiki/ISO_3166-2:AF Afghanistan /wiki/Afghanistan
## 4 AG /wiki/ISO_3166-2:AG Antigua and Barbuda /wiki/Antigua_and_Barbuda
## 5 AI /wiki/ISO_3166-2:AI Anguilla /wiki/Anguilla
## 6 AL /wiki/ISO_3166-2:AL Albania /wiki/Albania
## 7 AM /wiki/ISO_3166-2:AM Armenia /wiki/Armenia
## 8 AO /wiki/ISO_3166-2:AO Angola /wiki/Angola
## 9 AQ /wiki/ISO_3166-2:AQ Antarctica /wiki/Antarctica
## 10 AR /wiki/ISO_3166-2:AR Argentina /wiki/Argentina
## # ... with 239 more rows

How to scrape multiple tables that are without IDs or Class using R

I'm trying to scrape this webpage using R : http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all (All the pages)
I'm new to programming. And everywhere I've looked, tables are mostly identified with IDs or Divs or Class. On this page there's none. Data is stored in Table format. How should I scrape it?
This is what I did :
library(rvest)
webpage <- read_html("http://zipnet.in/index.php
page=missing_mobile_phones_search&criteria=browse_all")
tbls <- html_nodes(webpage, "table")
head(tbls)
tbls_ls <- webpage %>%
html_nodes("table") %>%
.[9:10] %>%
html_table(fill = TRUE)
colnames(tbls_ls[[1]]) <- c("Mobile Make", "State", "District",
"Police Station", "Status", "Mobile Type(GSM/CDMA)",
"FIR/DD/GD Dat")
You can scrape the table data by targeting the css id of each table. It looks like each page is composed of 3 different tables pasted one after another. Two of the tables have #AutoNumber15 css id while the third (in the middle) has the #AutoNumber16 css id.
I put a simple code example that should get you started in the right direction.
suppressMessages(library(tidyverse))
suppressMessages(library(rvest))
# define function to scrape the table data from a page
get_page <- function(page_id = 1) {
# default link
link <- "http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all&Page_No="
# build link
link <- paste0(link, page_id)
# get tables data
wp <- read_html(link)
wp %>%
html_nodes("#AutoNumber16, #AutoNumber15") %>%
html_table(fill = TRUE) %>%
bind_rows()
}
# get the data from the first three pages
iter_page <- 1:3
# this is just a progress bar
pb <- progress_estimated(length(iter_page))
# this code will iterate over pages 1 through 3 and apply the get_page()
# function defined earlier. The Sys.sleep() part is used to pause the code
# after each iteration so that the sever is not overloaded with requests.
map_df(iter_page, ~ {
pb$tick()$print()
df <- get_page(.x)
Sys.sleep(sample(10, 1) * 0.1)
as_tibble(df)
})
#> # A tibble: 72 x 4
#> X1 X2 X3
#> <chr> <chr> <chr>
#> 1 FIR/DD/GD Number 000165 State
#> 2 FIR/DD/GD Date 17/08/2017 District
#> 3 Mobile Type(GSM/CDMA) GSM Police Station
#> 4 Mobile Make SAMSUNG J2 Mobile Number
#> 5 Missing/Stolen Date 23/04/2017 IMEI Number
#> 6 Complainant AKEEL KHAN Complainant Contact Number
#> 7 Status Stolen/Theft Report Date/Time on ZIPNET
#> 8 <NA> <NA> <NA>
#> 9 FIR/DD/GD Number FIR No 37/ State
#> 10 FIR/DD/GD Date 17/08/2017 District
#> # ... with 62 more rows, and 1 more variables: X4 <chr>

Finding repeated sentences/words/phrases by group over time

I have a data-set in which each column is a variable and each row is an observation (like time series data. It looks like this (I apologize for the format, but I can't show the data):
I'd like to know if a person or group is saying the same thing(s) over time. I'm familiar with n-grams, but it's not quite what I need. Any help would be appreciated.
This is the output I'd like:
Sorry for all the edits poor comments; still getting used to the website.
If you want to see the frequence of each comments related to each Person and a new column Ready you can do this with the following code :
set.seed(123456)
### I use the same data as the previous example, thank you for providing this !
data <-data.frame(date = Sys.Date() - sample(100),
Group = c("Cars","Trucks") %>% sample(100,replace=T),
Reporting_person = c("A","B","C") %>% sample(100,replace=T),
Comments = c("Awesome","Meh","NC") %>% sample(100,replace=T),
Ready = as.character(c("Yes","No") %>% sample(100,replace=T))
)
library(dplyr)
data %>%
group_by(Reporting_person,Ready) %>%
count(Comments) %>%
mutate(prop = prop.table(n))
If what you are asking is to see if a change occurs in the comments over time and to see if that change is correlated with an event (like Ready) you can use the following code:
library(dplyr)
### Creating a column comments at time + plus
new = data %>%
arrange(Reporting_person,Group,date) %>%
group_by(Group,Reporting_person) %>%
mutate(comments_plusone=lag(Comments))
new = na.omit(new)
### Creating the change column 1 is a change , 0 no change
new$Change = as.numeric(new$Comments != new$comments_plusone)
### Get the correlation between Change and the events...
### Chi-test to test if correlation between the event and the change
### Not that using Pearson correlation is not pertinent here :
tbl <- table(new$Ready,new$Change)
chi2 = chisq.test(tbl, correct=F)
c(chi2$statistic, chi2$p.value)
sqrt(chi2$statistic / sum(tbl))
You should get no significative correlation with this example. As you can clearly see when you illustrate the table.
plot(tbl)
Not that using cor function is not appropriate working with two binary variable.
Here a post in this topic.... Correlation between two binary
Frequence of change by change of State
Following your comments, I am adding this code:
newR = data %>%
arrange(Reporting_person,Group,date) %>%
group_by(Group,Reporting_person) %>%
mutate(Ready_plusone=lag(Ready))
newR = na.omit(newR)
###------------------------Add the column to the new data frame
### Creating the REady change column 1 is a change , 0 no change
### Creating the change of state , I use this because you seem to have more than 2 levels.
new$State_change = paste(newR$Ready,newR$Ready_plusone,sep="_")
### Getting the frequency of Change by Change of State(Ready Yes-no..no-yes..)
result <- new %>%
group_by(Reporting_person,State_change) %>%
count(Change) %>%
mutate(Frequence = prop.table(n))%>%
filter(Change==1)
### Tidyr is a great library for reshape data, you want the wide format of the previous long
### dataframe... However doing this will generate a lot of NA so If I were you I would get
### the result format instead of the following but this could be helpful for future need so here you go.
library(tidyr)
final = as.data.frame(spread(result, key = State_change, value = prop))[,c(1,4:7)]
Hope this help :)
Something like this ?
df <-data.frame(date = Sys.Date() - sample(10),
Group = c("Cars","Trucks") %>% sample(10,replace=T),
Reporting_person = c("A","B","C") %>% sample(10,replace=T),
Comments = c("Awesome","Meh","NC") %>% sample(10,replace=T))
# date Group Reporting_person Comments
# 1 2017-06-08 Trucks B Awesome
# 2 2017-06-05 Trucks A Awesome
# 3 2017-06-14 Cars B Meh
# 4 2017-06-06 Cars B Awesome
# 5 2017-06-11 Cars A Meh
# 6 2017-06-07 Cars B NC
# 7 2017-06-09 Cars A NC
# 8 2017-06-10 Cars A NC
# 9 2017-06-13 Trucks C Awesome
# 10 2017-06-12 Trucks B NC
aggregate(date ~ .,df,length)
# Group Reporting_person Comments date
# 1 Trucks A Awesome 1
# 2 Cars B Awesome 1
# 3 Trucks B Awesome 1
# 4 Trucks C Awesome 1
# 5 Cars A Meh 1
# 6 Cars B Meh 1
# 7 Cars A NC 2
# 8 Cars B NC 1
# 9 Trucks B NC 1

Resources