Web scraping from wikipedia tables in R

Web scraping from wikipedia tables in R - r

I'm trying to scrape some data from the following wikipedia table:
Link: https://en.wikipedia.org/wiki/Aire-la-Ville
I am using this code to scrape Area, Elevation, and density using css selectors. I am storing the data in canton_table but only getting the elevation data and not for the other variables.
My code:
# Get labels and data
labels <- current_html %>% html_elements(css = ".infobox-label") %>% html_text()
data <- current_html %>% html_elements(css = ".infobox-data") %>% html_text()
Output for labels and data variables:
> labels
[1] "Country" "Canton" "District"
[4] " • Mayor" " • Total" "Elevation"
[7] " • Total" " • Density" "Time zone"
[10] " • Summer (DST)" "Postal code(s)" "SFOS number"
[13] "Surrounded by" "Website"
>data
[1] "Switzerland"
[2] "Geneva"
[3] "n.a."
[4] "MaireRaymond Gavillet"
[5] "6.50 km2 (2.51 sq mi)"
[6] "428 m (1,404 ft)"
[7] "11,609"
[8] "1,800/km2 (4,600/sq mi)"
[9] "UTC+01:00 (Central European Time)"
[10] "UTC+02:00 (Central European Summer Time)"
[11] "1234,1255"
[12] "6645"
[13] "Bossey (FR-74), Carouge, Chêne-Bougeries, Étrembières (FR-74), Gaillard (FR-74), Geneva (Genève), Plan-les-Ouates, Thônex, Troinex"
[14] "www.veyrier.ch SFSO statistics"
I am able to populate the table with only elevation data and not area and density. Please help. Thanks!
# Clean text and store in data frame
canton_table[canton_table$name == current_name, "area"] <- helper_function(" • Total", labels, data)
canton_table[canton_table$name == current_name, "elevation"] <- helper_function("Elevation", labels, data)
canton_table[canton_table$name == current_name, "density"] <- helper_function(" • Density", labels, data)
My output table:
My output table:

You have to change the labels name in the labels array, from " • Total" to "Total" eccs.
The names like this " • Total" are probably giving references problems.
And then create the table
canton_table[canton_table$name == current_name, "area"] <- helper_function("Total", labels, data)

Related

How to download data from the Reptile database using r

I am using R to try and download images from the Reptile-database by filling their form to seek for specific images. For that, I am following previous suggestions to fill a form online from R, such as:
library(httr)
library(tidyverse)
POST(
url = "http://reptile-database.reptarium.cz/advanced_search",
encode = "json",
body = list(
genus = "Chamaeleo",
species = "dilepis"
)) -> res
out <- content(res)[1]
This seems to work smoothly, but my problem now is to identify the link with the correct species name in the resulting out object.
This object should contain the following page:
https://reptile-database.reptarium.cz/species?genus=Chamaeleo&species=dilepis&search_param=%28%28genus%3D%27Chamaeleo%27%29%28species%3D%27dilepis%27%29%29
This contains names with links. Thus, i would like to identify the link that takes me to the page with the correct species's table. however I am unable to find the link nor even the name of the species within the generated out object.

Here I only extract the links to the pictures. Simply map or apply a function to download them with download.file()
library(tidyverse)
library(rvest)
genus <- "Chamaeleo"
species <- "dilepis"
pics <- paste0(
"http://reptile-database.reptarium.cz/species?genus=", genus,
"&species=", species) %>%
read_html() %>%
html_elements("#gallery img") %>%
html_attr("src")
[1] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000034021_01_t.jpg"
[2] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000033342_01_t.jpg"
[3] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029987_01_t.jpg"
[4] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029988_01_t.jpg"
[5] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035130_01_t.jpg"
[6] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035131_01_t.jpg"
[7] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035132_01_t.jpg"
[8] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035133_01_t.jpg"
[9] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036237_01_t.jpg"
[10] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036238_01_t.jpg"
[11] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036239_01_t.jpg"
[12] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041048_01_t.jpg"
[13] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041049_01_t.jpg"
[14] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041050_01_t.jpg"
[15] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041051_01_t.jpg"
[16] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042287_01_t.jpg"
[17] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042288_01_t.jpg"
[18] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0070.jpeg"
[19] "https://calphotos.berkeley.edu/imgs/128x192/1338_3161/0662/0074.jpeg"
[20] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0082.jpeg"
[21] "https://calphotos.berkeley.edu/imgs/128x192/1338_3152/3386/0125.jpeg"
[22] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/1009/0136.jpeg"
[23] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/0210/0057.jpeg"

Levels of a dataframe after filtering

i've been doing an assignment for a self study in R programming. I have a question about what happens with factors in a dataframe once you filter it. I have a dataframe that has the columns (movie)Studio and Genre.
For the assignment i need to filter it. I succeeded in this, but when i check the levels of the newly filtered columns all factors are still present, so not only the filtered ones.
Why is this? Am i doing something wrong?
StudioTarget <- c("Buena Vista Studios","Fox","Paramount Pictures","Sony","Universal","WB")
GenreTarget <- c("action","adventure","animation","comedy","drama")
dftest <- df[df$Studio %in% StudioTarget & df$Genre %in% GenreTarget,]
> levels(dftest$Studio)
[1] "Art House Studios" "Buena Vista Studios" "Colombia Pictures"
[4] "Dimension Films" "Disney" "DreamWorks"
[7] "Fox" "Fox Searchlight Pictures" "Gramercy Pictures"
[10] "IFC" "Lionsgate" "Lionsgate Films"
[13] "Lionsgate/Summit" "MGM" "MiraMax"
[16] "New Line Cinema" "New Market Films" "Orion"
[19] "Pacific Data/DreamWorks" "Paramount Pictures" "Path_ Distribution"
[22] "Relativity Media" "Revolution Studios" "Screen Gems"
[25] "Sony" "Sony Picture Classics" "StudioCanal"
[28] "Summit Entertainment" "TriStar" "UA Entertainment"
[31] "Universal" "USA" "Vestron Pictures"
[34] "WB" "WB/New Line" "Weinstein Company"

You can do droplevels(dftest$Studio) to remove unused levels

No, you're not doing anything wrong. A factor defines a fixed number of levels. These levels remain the same even if one or more of them are not present in the data. You've asked for the levels of your factor, not the values present after filtering.
Consider:
library(tidyverse)
mtcars %>%
mutate(cyl= as.factor(cyl)) %>%
filter(cyl == 4) %>%
distinct(cyl) %>%
pull(cyl)
[1] 4
Levels: 4 6 8
Welcome to SO. Next time, pleasetry to provide a minumum working example. This post will help you construct one.

R: scrape nested html table with links (table within cell)

For university research I try to scrape an FDA table (robots.txt allows to scrape this content)
The table contains 19 rows and 2 columns:
https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181
The format I try to extract is:
col1 col2 url_of_col2
<chr> <chr> <chr>
1 Device Classificati~ distal transcutaneous electrical stimulator for treatm~ https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpcd/classification.cfm?s~
What I achieved:
I can easly extract the items of the first column:
#library
library(tidyverse)
library(xml2)
library(rvest)
#load html
html <- xml2::read_html("https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181")
# select table of interest
html %>%
html_nodes("table") -> tables
tables[[9]] -> table
# extract col 1 items
table %>%
html_nodes("th") %>%
html_text() %>%
gsub("\n|\t|\r","",.) %>%
trimws()
#> [1] "Device Classification Name" "510(k) Number"
#> [3] "Device Name" "Applicant"
#> [5] "Applicant Contact" "Correspondent"
#> [7] "Correspondent Contact" "Regulation Number"
#> [9] "Classification Product Code" "Date Received"
#> [11] "Decision Date" "Decision"
#> [13] "Regulation Medical Specialty" "510k Review Panel"
#> [15] "summary" "Type"
#> [17] "Clinical Trials" "Reviewed by Third Party"
#> [19] "Combination Product"
Created on 2021-02-27 by the reprex package (v1.0.0)
Where I get stuck
Since some cells of column 2 contain a table, this approach does not give the same number of items:
# extract col 2 items
table %>%
html_nodes("td") %>%
html_text()%>%
gsub("\n|\t|\r","",.) %>%
trimws()
#> [1] "distal transcutaneous electrical stimulator for treatment of acute migraine"
#> [2] "K203181"
#> [3] "Nerivio, FGD000075-4.7"
#> [4] "Theranica Bioelectronics ltd4 Ha-Omanutst. Poleg Industrial Parknetanya, IL4250574"
#> [5] "Theranica Bioelectronics ltd"
#> [6] "4 Ha-Omanutst. Poleg Industrial Park"
#> [7] "netanya, IL4250574"
#> [8] "alon ironi"
#> [9] "Hogan Lovells US LLP1735 Market StreetSuite 2300philadelphia, PA 19103"
#> [10] "Hogan Lovells US LLP"
#> [11] "1735 Market Street"
#> [12] "Suite 2300"
#> [13] "philadelphia, PA 19103"
#> [14] "janice m. hogan"
#> [15] "882.5899"
#> [16] "QGT  "
#> [17] "QGT  "
#> [18] "10/26/2020"
#> [19] "01/22/2021"
#> [20] "substantially equivalent (SESE)"
#> [21] "Neurology"
#> [22] "Neurology"
#> [23] "summary"
#> [24] "Traditional"
#> [25] "NCT04089761"
#> [26] "No"
#> [27] "No"
Created on 2021-02-27 by the reprex package (v1.0.0)
Moreover, I could not find a way to extract the urls of col2
I found a good manual to read html tables with cells spanning on multiple rows. However, I think this approach does not work for nested dataframes.
There is similar question regarding a nested table without links (How to scrape older html with nested tables in R?) which has not been answered yet. A comment suggested this question, unfortunately I could not apply it to my html table.
There is the unpivotr package that aims to read nested html tables, however, I could not solve my problem with that package.

Yes the tables within the rows of the parent table does make it more difficult. The key for this one is to find the 27 rows of the table and then parse each row individually.
library(rvest)
library(stringr)
library(dplyr)
#load html
html <- xml2::read_html("https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181")
# select table of interest
tables <- html %>% html_nodes("table")
table <- tables[[9]]
#find all of the table's rows
trows <- table %>% html_nodes("tr")
#find the left column
leftside <- trows %>% html_node("th") %>% html_text() %>% trimws()
#find the right column (remove white at the end and in the middle)
rightside <- trows %>% html_node("td") %>% html_text() %>% str_squish() %>% trimws()
#get links
links <-trows %>% html_node("td a") %>% html_attr("href")
answer <-data.frame(leftside, rightside, links)
One will will need to use paste("https://www.accessdata.fda.gov/", answer$links) on some of the links to obtain the full web address.
The final dataframe does have several cells containing "NA" these can be removed and the table can be cleaned up some more depending on the final requirements. See tidyr::fill() as a good starting point.
Update
To reduce the answer down to the desired 19 original rows:
library(tidyr)
#replace NA with blanks
answer$links <- replace_na(answer$links, "")
#fill in the blank is the first column to allow for grouping
answer <-fill(answer, leftside, .direction = "down")
#Create the final results
finalanswer <- answer %>% group_by(leftside) %>%
summarize(info=paste(rightside, collapse = " "), link=first(links))

Find out POI (within 2km) using latitude and longitude

I have a dataset which corresponding of Zipcode along with lat and log.I want to find out list of hospital/bank(within 2km) from that latitude and longitude.
How to do it?
The Long/Lat data looks like
store_zip lon lat
410710 73.8248981 18.5154681
410209 73.0907 19.0218215
400034 72.8148177 18.9724162
400001 72.836334 18.9385352
400102 72.834424 19.1418961
400066 72.8635299 19.2313448
400078 72.9327444 19.1570343
400078 72.9327444 19.1570343
400007 72.8133825 18.9618411
400050 72.8299518 19.0551695
400062 72.8426858 19.1593396
400083 72.9374227 19.1166191
400603 72.9781047 19.1834148
401107 72.8929 19.2762702
401105 72.8663173 19.3053477
400703 72.9992013 19.0793547
401209 NA NA
401203 72.7983705 19.4166761
400612 73.0287209 19.1799265
400612 73.0287209 19.1799265
400612 73.0287209 19.1799265

If your Points of Interest are unknown and you need to find them, you can use Google's API through my googleway package (as you've suggested in the comments). You will need a valid API key for this to work.
As the API can only accept one request at a time, you'll need to iterate over your data one row at a time. For that you can use whatever looping method you're most comforatable with
library(googleway) ## using v2.4.0 on CRAN
set_key("your_api_key")
lst <- lapply(1:nrow(df), function(x){
google_places(search_string = "Hospital",
location = c(df[x, 'lat'], df[x, 'lon']),
radius = 2000)
})
lst is now a list that contains the results of the queries. For example, the names of the hospitals it has returned for the first row of your data is
place_name(lst[[1]])
# [1] "Jadhav Hospital"
# [2] "Poona Hospital Medical Store"
# [3] "Sanjeevan Hospital"
# [4] "Suyash Hospital"
# [5] "Mehta Hospital"
# [6] "Deenanath Mangeshkar Hospital"
# [7] "Sushrut Hospital"
# [8] "Deenanath Mangeshkar Hospital and Research Centre"
# [9] "MMF Ratna Memorial Hospital"
# [10] "Maharashtra Medical Foundation's Joshi Multispeciality Hospital"
# [11] "Sahyadri Hospitals"
# [12] "Deendayal Memorial Hospital"
# [13] "Jehangir Specialty Hospital"
# [14] "Global Hospital And Research Institute"
# [15] "Prayag Hospital"
# [16] "Apex Superspeciality Hospital"
# [17] "Deoyani Multi Speciality Hospital"
# [18] "Shashwat Hospital"
# [19] "Deccan Multispeciality Hardikar Hospital"
# [20] "City Hospital"
You can also view them on a map
set_key("map_api_key", api = "map")
## the lat/lon of the returned results are found through `place_location()`
# place_location(lst[[1]])
df_hospitals <- place_location(lst[[1]])
df_hospitals$name <- place_name(lst[[1]])
google_map() %>%
add_circles(data = df[1, ], radius = 2000) %>%
add_markers(data = df_hospitals, info_window = "name")
Note:
Google's API is limited to 2,500 queries per key per day, unless you pay for a premium account.

Replace all non-alphanumeric with a period

I am trying to rename all of these atrocious column names in a data frame I received from a government agency.
> colnames(thedata)
[1] "Region" "Resource Assessment Site ID"
[3] "Site Name/Facility" "Design Head (feet)"
[5] "Design Flow (cfs)" "Installed Capacity (kW)"
[7] "Annual Production (MWh)" "Plant Factor"
[9] "Total Construction Cost (1,000 $)" "Annual O&M Cost (1,000 $)"
[11] "Cost per Installed Capacity ($/kW)" "Benefit Cost Ratio with Green Incentives"
[13] "IRR with Green Incentives" "Benefit Cost Ratio without Green Incentives"
[15] "IRR without Green Incentives"
The column headers have special non-alphanumeric characters and spaces, so referring to them is impossible so I have to rename them. I would like to replace all non-alphanumeric characters with a period. But I tried:
old.col.names <- colnames(thedata)
new.col.names <- gsub("^a-z0-9", ".", old.col.names)
The ^ is a "not" delineation, so I thought it would replace everything that is not alphanumeric with a period in the old.col.names.
Can anyone help?

Here are three options to consider:
make.names(x)
gsub("[^A-Za-z0-9]", ".", x)
names(janitor::clean_names(setNames(data.frame(matrix(NA, ncol = length(x))), x)))
Here's what each looks like:
make.names(x)
## [1] "Region" "Resource.Assessment.Site.ID"
## [3] "Site.Name.Facility" "Design.Head..feet."
## [5] "Design.Flow..cfs." "Installed.Capacity..kW."
## [7] "Annual.Production..MWh." "Plant.Factor"
## [9] "Total.Construction.Cost..1.000..." "Annual.O.M.Cost..1.000..."
## [11] "Cost.per.Installed.Capacity....kW." "Benefit.Cost.Ratio.with.Green.Incentives"
## [13] "IRR.with.Green.Incentives" "Benefit.Cost.Ratio.without.Green.Incentives"
## [15] "IRR.without.Green.Incentives"
gsub("[^A-Za-z0-9]", ".", x)
## [1] "Region" "Resource.Assessment.Site.ID"
## [3] "Site.Name.Facility" "Design.Head..feet."
## [5] "Design.Flow..cfs." "Installed.Capacity..kW."
## [7] "Annual.Production..MWh." "Plant.Factor"
## [9] "Total.Construction.Cost..1.000..." "Annual.O.M.Cost..1.000..."
## [11] "Cost.per.Installed.Capacity....kW." "Benefit.Cost.Ratio.with.Green.Incentives"
## [13] "IRR.with.Green.Incentives" "Benefit.Cost.Ratio.without.Green.Incentives"
## [15] "IRR.without.Green.Incentives"
library(janitor)
names(clean_names(setNames(data.frame(matrix(NA, ncol = length(x))), x)))
## [1] "region" "resource_assessment_site_id"
## [3] "site_name_facility" "design_head_feet"
## [5] "design_flow_cfs" "installed_capacity_kw"
## [7] "annual_production_mwh" "plant_factor"
## [9] "total_construction_cost_1_000" "annual_o_m_cost_1_000"
## [11] "cost_per_installed_capacity_kw" "benefit_cost_ratio_with_green_incentives"
## [13] "irr_with_green_incentives" "benefit_cost_ratio_without_green_incentives"
## [15] "irr_without_green_incentives"
Sample data:
x <- c("Region", "Resource Assessment Site ID", "Site Name/Facility",
"Design Head (feet)", "Design Flow (cfs)", "Installed Capacity (kW)",
"Annual Production (MWh)", "Plant Factor", "Total Construction Cost (1,000 $)",
"Annual O&M Cost (1,000 $)", "Cost per Installed Capacity ($/kW)",
"Benefit Cost Ratio with Green Incentives", "IRR with Green Incentives",
"Benefit Cost Ratio without Green Incentives", "IRR without Green Incentives")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web scraping from wikipedia tables in R - r

You have to change the labels name in the labels array, from " • Total" to "Total" eccs. The names like this " • Total" are probably giving references problems. And then create the table canton_table[canton_table$name == current_name, "area"] <- helper_function("Total", labels, data)

Related

How to download data from the Reptile database using r

Levels of a dataframe after filtering

R: scrape nested html table with links (table within cell)

Find out POI (within 2km) using latitude and longitude

Replace all non-alphanumeric with a period

Categories

Resources