scraping data table from clinicaltrials.gov with rvest - r

I'd like to scrape this data table when I put in search terms on clinicaltrials.gov. Specifically, I'd like to scrape the table you see on this page: https://clinicaltrials.gov/ct2/results?term=nivolumab+AND+Overall+Survival. See below for screenshot:
I've tried this code, but I don't think I got the right css selector:
# create custom url
ctgov_url <- "https://clinicaltrials.gov/ct2/results?term=nivolumab+AND+Overall+Survival"
# read HTML page
ct_page <- rvest::read_html(ctgov_url)
# extract related terms
ct_page %>%
# find elements that match a css selector
rvest::html_element("t") %>%
# retrieve text from element (html_text() is much faster than html_text2())
rvest::html_table()

You don't need rvest here at all. The page provides a download button to get a csv of the search items. This has a basic url-encoded GET syntax which allows you to create a simple little API:
get_clin_trials_data <- function(terms, n = 1000) {
terms<- URLencode(paste(terms, collapse = " AND "))
df <- read.csv(paste0(
"https://clinicaltrials.gov/ct2/results/download_fields",
"?down_count=", n, "&down_flds=shown&down_fmt=csv",
"&term=", terms, "&flds=a&flds=b&flds=y"))
dplyr::as_tibble(df)
}
This allows you to pass in a vector of search terms and a maximum number of results to return. No need for complex parsing as would be required with web scraping.
get_clin_trials_data(c("nivolumab", "Overall Survival"), n = 10)
#> # A tibble: 10 x 8
#> Rank Title Status Study.Results Conditions Interventions Locations URL
#> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 A Study ~ Compl~ No Results A~ Hepatocel~ "" "Bristol~ http~
#> 2 2 Nivoluma~ Activ~ No Results A~ Glioblast~ "Drug: Nivol~ "Duke Un~ http~
#> 3 3 Nivoluma~ Unkno~ No Results A~ Melanoma "Biological:~ "CHU d'A~ http~
#> 4 4 Study of~ Compl~ Has Results Advanced ~ "Biological:~ "Highlan~ http~
#> 5 5 A Study ~ Unkno~ No Results A~ Brain Met~ "Drug: Fotem~ "Medical~ http~
#> 6 6 Trial of~ Compl~ Has Results Squamous ~ "Drug: Nivol~ "Stanfor~ http~
#> 7 7 Nivoluma~ Compl~ No Results A~ MGMT-unme~ "Drug: Nivol~ "New Yor~ http~
#> 8 8 Study of~ Compl~ Has Results Squamous ~ "Biological:~ "Mayo Cl~ http~
#> 9 9 Study of~ Compl~ Has Results Non-Squam~ "Biological:~ "Mayo Cl~ http~
#> 10 10 An Open-~ Unkno~ No Results A~ Squamous-~ "Drug: Nivol~ "IRCCS -~ http~
Created on 2022-06-21 by the reprex package (v2.0.1)

Related

I need help merging two rows based on certain string character, the string is complaint

I am trying to calculate the fraction of the construction noise per zip code across NY city. The data is from NYC 311.
I am using dplyr and have grouped the data per zip.
However, I am finding difficulties merging the row for the complain column, I have to merge the data as per the string "construction" it appear anywhere meaning middle, front or end.
My solution, this is just the beginning
comp_types <- df %>% select(complaint_type,descriptor,incident_zip) %>%
group_by(incident_zip)
can you help me merge the row if unique value in descriptor contains any construction value.
Can you clarify what you mean by "merging"? I don't think you actually want to merge because you only have one dataframe. The term "merging" is used to describe the joining of two dataframes.
See ?base::merge:
Merge two data frames by common columns or row names, or do other versions of database join operations.
If I understand correctly, you want to look into the descriptor variable and see if it contains the string "construction" anywhere in the cell, so you can determine if the person's complaint was construction-related; same for "music". I don't believe you need to use complaint_type since complaint_type never contains the string "construction" or "music"; only descriptor does.
You can use a combination of ifelse and grepl to create a new variable that indicates whether the complaint was construction-related, music-related, or other.
library(tidyverse)
library(janitor)
url <- "https://data.cityofnewyork.us/api/views/p5f6-bkga/rows.csv"
df <- read.csv(url, nrows = 10000) %>%
clean_names() %>%
select(complaint_type, descriptor, incident_zip)
comp_types <- df %>%
select(complaint_type, descriptor, incident_zip) %>%
group_by(incident_zip)
head(comp_types)
#> # A tibble: 6 × 3
#> # Groups: incident_zip [6]
#> complaint_type descriptor incident_zip
#> <chr> <chr> <int>
#> 1 Noise - Residential Banging/Pounding 11364
#> 2 Noise - Residential Loud Music/Party 11222
#> 3 Noise - Residential Banging/Pounding 10033
#> 4 Noise - Residential Loud Music/Party 11208
#> 5 Noise - Residential Loud Music/Party 10037
#> 6 Noise Noise: Construction Before/After Hours (NM1) 11238
table(df$complaint_type)
#>
#> Noise Noise - Commercial Noise - Helicopter
#> 555 591 145
#> Noise - House of Worship Noise - Park Noise - Residential
#> 20 72 5675
#> Noise - Street/Sidewalk Noise - Vehicle
#> 2040 902
df <- df %>%
mutate(descriptor_misc = ifelse(grepl("Construction", descriptor), "Construction",
ifelse(grepl("Music", descriptor), "Music", "Other")))
df %>%
group_by(descriptor_misc) %>%
count()
#> # A tibble: 3 × 2
#> # Groups: descriptor_misc [3]
#> descriptor_misc n
#> <chr> <int>
#> 1 Construction 328
#> 2 Music 6354
#> 3 Other 3318
head(df)
#> complaint_type descriptor incident_zip
#> 1 Noise - Residential Banging/Pounding 11364
#> 2 Noise - Residential Loud Music/Party 11222
#> 3 Noise - Residential Banging/Pounding 10033
#> 4 Noise - Residential Loud Music/Party 11208
#> 5 Noise - Residential Loud Music/Party 10037
#> 6 Noise Noise: Construction Before/After Hours (NM1) 11238
#> descriptor_misc
#> 1 Other
#> 2 Music
#> 3 Other
#> 4 Music
#> 5 Music
#> 6 Construction

Webscraping Tables From Wikipedia in R

I was wondering if anyone had useful ideas or code for web scraping tables from Wikipedia.
Specifically, I'm interested in the Presidential election results table in the "Results by county" section on Wikipedia.
An example table can be found using the following link and scrolling down to the "Results by county" section: https://en.wikipedia.org/wiki/1948_United_States_presidential_election_in_Texas
The table looks like this:
I've tried some solutions from the following StackOverflow post: Importing wikipedia tables in R
However, they don't appear to be appliable to the type of table I want to scrape in Wikipedia.
Any advice, solutions, or code would be greatly appreciated. Thank you!
Making use of the rvest package you could get the table by first selecting the element containing the desired table via html_element("table.wikitable.sortable") and then extracting the table via html_table() like so:
library(rvest)
url <- "https://en.wikipedia.org/wiki/1948_United_States_presidential_election_in_Texas"
html <- read_html(url)
county_table <- html %>%
html_element("table.wikitable.sortable") %>%
html_table()
head(county_table)
#> # A tibble: 6 x 14
#> County `Harry S. Truman… `Harry S. Truman… `Thomas E. Dewey… `Thomas E. Dewe…
#> <chr> <chr> <chr> <chr> <chr>
#> 1 County # % # %
#> 2 Anders… 3,242 62.37% 1,199 23.07%
#> 3 Andrews 816 85.27% 101 10.55%
#> 4 Angeli… 4,377 69.05% 1,000 15.78%
#> 5 Aransas 418 61.02% 235 34.31%
#> 6 Archer 1,599 86.20% 191 10.30%
#> # … with 9 more variables: Strom ThurmondStates’ Rights Democratic <chr>,
#> # Strom ThurmondStates’ Rights Democratic.1 <chr>,
#> # Henry A. WallaceProgressive <chr>, Henry A. WallaceProgressive.1 <chr>,
#> # Various candidatesOther parties <chr>,
#> # Various candidatesOther parties.1 <chr>, Margin <chr>, Margin.1 <chr>,
#> # Total votes cast[11] <chr>

Is rvest the best tool to collect information from this table?

I have used rvest package to extract a list of companies and the a.href elements in each company, which I need to proceed with the data collection process. This is the link of the website: http://www.bursamalaysia.com/market/listed-companies/list-of-companies/main-market.
I have used the following code to extract the table but nothing comes out. I used other approaches as those posted in "Scraping table of NBA stats with rvest" and similar links, but I cannot obtain what I want. Any help would be greatly appreciated.
my code:
link.main <-
"http://www.bursamalaysia.com/market/listed-companies/list-of-companies/main-market/"
web <- read_html(link.main) %>%
html_nodes("table#bm_equities_prices_table")
# it does not work even when I write html_nodes("table")
or ".table" or #bm_equities_prices_table
web <- read_html(link.main)
%>% html_nodes(".bm_center.bm_dataTable")
# no working
web <- link.main %>% read_html() %>% html_table()
# to inspect the position of table in this website
The page generates the table using JavaScript, so you either need to use RSelenium or Python's Beautiful Soup to simulate the browser session and allow javascript to run.
Another alternative is to use awesome package by #hrbrmstr called decapitated, which basically runs headless Chrome browser session in the background.
#devtools::install_github("hrbrmstr/decapitated")
library(decapitated)
library(rvest)
res <- chrome_read_html(link.main)
main_df <- res %>%
rvest::html_table() %>%
.[[1]] %>%
as_tibble()
This outputs the content of the table alright. If you want to get to the elements underlying the table (href attributes behind the table text), you will need to do a bit more of list gymnastics. Some of the elements in the table are actually missing links, extracting by css proved to be difficult.
library(dplyr)
library(purrr)
href_lst <- res %>%
html_nodes("table td") %>%
as_list() %>%
map("a") %>%
map(~attr(.x, "href"))
# we need every third element starting from second element
idx <- seq.int(from=2, by=3, length.out = nrow(main_df))
href_df <- tibble(
market_href=as.character(href_lst[idx]),
company_href=as.character(href_lst[idx+1])
)
bind_cols(main_df, href_df)
#> # A tibble: 800 x 5
#> No `Company Name` `Company Website` market_href company_href
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 7-ELEVEN MALAYS~ http://www.7elev~ /market/list~ http://www.~
#> 2 2 A-RANK BERHAD [~ http://www.arank~ /market/list~ http://www.~
#> 3 3 ABLEGROUP BERHA~ http://www.gefun~ /market/list~ http://www.~
#> 4 4 ABM FUJIYA BERH~ http://www.abmfu~ /market/list~ http://www.~
#> 5 5 ACME HOLDINGS B~ http://www.suppo~ /market/list~ http://www.~
#> 6 6 ACOUSTECH BERHA~ http://www.acous~ /market/list~ http://www.~
#> 7 7 ADVANCE SYNERGY~ http://www.asb.c~ /market/list~ http://www.~
#> 8 8 ADVANCECON HOLD~ http://www.advan~ /market/list~ http://www.~
#> 9 9 ADVANCED PACKAG~ http://www.advan~ /market/list~ http://www.~
#> 10 10 ADVENTA BERHAD ~ http://www.adven~ /market/list~ http://www.~
#> # ... with 790 more rows
Another option without using browser:
library(httr)
library(jsonlite)
library(XML)
r <- httr::GET(paste0(
"http://ws.bursamalaysia.com/market/listed-companies/list-of-companies/list_of_companies_f.html",
"?_=1532479072277",
"&callback=jQuery16206432131784246533_1532479071878",
"&alphabet=",
"&market=main_market",
"&_=1532479072277"))
l <- rawToChar(r$content)
m <- gsub("jQuery16206432131784246533_1532479071878(", "", substring(l, 1, nchar(l)-1), fixed=TRUE)
tbl <- XML::readHTMLTable(jsonlite::fromJSON(m)$html)$bm_equities_prices_table
output:
> head(tbl)
# No Company Name Company Website
#1 1 7-ELEVEN MALAYSIA HOLDINGS BERHAD http://www.7eleven.com.my
#2 2 A-RANK BERHAD [S] http://www.arank.com.my
#3 3 ABLEGROUP BERHAD [S] http://www.gefung.com.my
#4 4 ABM FUJIYA BERHAD [S] http://www.abmfujiya.com.my
#5 5 ACME HOLDINGS BERHAD [S] http://www.supportivetech.com/
#6 6 ACOUSTECH BERHAD [S] http://www.acoustech.com.my/

Tableau LOD R Equivalent

I'm using a Tableau Fixed LOD function in a report, and was looking for ways to mimic this functionality in R.
Data set looks like:
Soldto<-c("123456","122456","123456","122456","124560","125560")
Shipto<-c("123456","122555","122456","124560","122560","122456")
IssueDate<-as.Date(c("2017-01-01","2017-01-02","2017-01-01","2017-01-02","2017-01-01","2017-01-01"))
Method<-c("Ground","Ground","Ground","Air","Ground","Ground")
Delivery<-c("000123","000456","000123","000345","000456","000555")
df1<-data.frame(Soldto,Shipto,IssueDate,Method,Delivery)
What I'm looking to do is "For each Sold-to/Ship-to/Method count the number of unique delivery IDs".
The intent is to find the number of unique deliveries that could potentially be "aggregated."
In Tableau that function looks like:
{FIXED [Soldto],[Shipto],[IssueDate],[Method],:countd([Delivery])
Could this be done with aggregate or summarize as in an example below:
df.new<-ddply(df,c("Soldto","Shipto","Method"),summarise,
Deliveries = n_distinct(Delivery))
This is fairly easy with dplyr. You are looking for the number of unique delivery for each combination of soldto, shipto and method, which is just group_by and then summarise:
library(tidyverse)
tbl <- tibble(
soldto = c("123456","122456","123456","122456","124560","125560"),
shipto = c("123456","122555","122456","124560","122560","122456"),
issuedate = as.Date(c("2017-01-01","2017-01-02","2017-01-01","2017-01-02","2017-01-01","2017-01-01")),
method = c("Ground","Ground","Ground","Air","Ground","Ground"),
delivery = c("000123","000456","000123","000345","000456","000555")
)
tbl %>%
group_by(soldto, shipto, method) %>%
summarise(uniques = n_distinct(delivery))
#> # A tibble: 6 x 4
#> # Groups: soldto, shipto [?]
#> soldto shipto method uniques
#> <chr> <chr> <chr> <int>
#> 1 122456 122555 Ground 1
#> 2 122456 124560 Air 1
#> 3 123456 122456 Ground 1
#> 4 123456 123456 Ground 1
#> 5 124560 122560 Ground 1
#> 6 125560 122456 Ground 1
Created on 2018-03-02 by the reprex package (v0.2.0).

r rvest webscraping hltv

Yes, that's just another "how-to-scrape" question. Sorry for that, but I've read the previous answers and the manual for rvest as well.
I'm doing web-scraping for my homework (so I do not plan to use the data for any commercial issue). The idea is to show that average skill of team affect individual skill. I'm trying to use CS:GO data from HLTV.org for it.
The information is available at http://www.hltv.org/?pageid=173&playerid=9216
I need two tables: Keystats (data only) and Teammates (data and URLs). I try to use CSS selectors generated by SelectorGadget and I also tryed to analyze the source code of webpage. I've failed. I'm doing the following:
library(rvest)
library(dplyr)
url <- 'http://www.hltv.org/?pageid=173&playerid=9216'
info <- html_session(url) %>% read_html()
info %>% html_node('.covSmallHeadline') %>% html_text()
Can you please tell me that is the right CSS selector?
If you look at the source, those tables aren't HTML tables, but just piles of divs with inconsistent nesting and inline CSS for alignment. Thus, it's easiest to just grab all the text and fix the strings afterwards, as the data is either all numeric or not at all.
library(rvest)
library(tidyverse)
h <- 'http://www.hltv.org/?pageid=173&playerid=9216' %>% read_html()
h %>% html_nodes('.covGroupBoxContent') %>% .[-1] %>%
html_text(trim = TRUE) %>%
strsplit('\\s*\\n\\s*') %>%
setNames(map_chr(., ~.x[1])) %>% map(~.x[-1]) %>%
map(~data_frame(variable = gsub('[.0-9]+', '', .x),
value = parse_number(.x)))
#> $`Key stats`
#> # A tibble: 9 × 2
#> variable value
#> <chr> <dbl>
#> 1 Total kills 9199.00
#> 2 Headshot %% 46.00
#> 3 Total deaths 6910.00
#> 4 K/D Ratio 1.33
#> 5 Maps played 438.00
#> 6 Rounds played 11242.00
#> 7 Average kills per round 0.82
#> 8 Average deaths per round 0.61
#> 9 Rating (?) 1.21
#>
#> $TeammatesRating
#> # A tibble: 4 × 2
#> variable value
#> <chr> <dbl>
#> 1 Gabriel 'FalleN' Toledo 1.11
#> 2 Fernando 'fer' Alvarenga 1.11
#> 3 Joao 'felps' Vasconcellos 1.09
#> 4 Epitacio 'TACO' de Melo 0.98

Resources