I am new to web scraping and been trying to scrape the right-hand side list of UK local authorities and the number of Covid-19 cases.
Here is the website:
https://www.arcgis.com/apps/opsdashboard/index.html#/f94c3c90da5b4e9f9a0b19484dd4bb14
I have been able to scrape Wikipedia, but I don't have any idea where to start with the above website. Any tip/links would be very helpful and much appreciated!
I have been able to get some numbers in the page with the following code :
library(rvest)
library(RSelenium)
port <- as.integer(4444L + rpois(lambda = 1000, 1))
rd <- rsDriver(chromever = "105.0.5195.52", browser = "chrome", port = port)
remDr <- rd$client
remDr$open()
url <- "https://coronavirus.data.gov.uk/"
remDr$navigate(url)
html_Content <- remDr$getPageSource()[[1]]
text <- read_html(html_Content) %>% html_text2()
text <- strsplit(text, "\n")[[1]]
text[54 : 72]
[1] "Last 7 days – first dose"
[2] "10,536Number of people vaccinated (first dose) in the 7 days to 2 October 2022"
[3] "Total – first dose"
[4] "45,275,970Total number of people vaccinated (first dose) reported on 2 October 2022"
[5] "Last 7 days – second dose"
[6] "18,800Number of people vaccinated (second dose) in the 7 days to 2 October 2022"
[7] "Total – second dose"
[8] "42,718,917Total number of people vaccinated (second dose) reported on 2 October 2022"
[9] "Last 7 days – booster or third dose"
[10] "25,518Number of people vaccinated (booster or third dose) in the 7 days to 2 October 2022"
[11] "Total – booster or third dose"
[12] "33,613,297Total number of people vaccinated (booster or third dose) reported on 2 October 2022"
[13] "Percentage of population aged 12+"
[14] "93.6%Percentage of population aged 12+ vaccinated (first dose) reported on 2 October 2022"
[15] "First dose"
[16] "88.3%Percentage of population aged 12+ vaccinated (second dose) reported on 2 October 2022"
[17] "Second dose"
[18] "69.5%Percentage of population aged 12+ vaccinated (booster or third dose) reported on 2 October 2022"
[19] "Booster or third dose"
I hope this is helpful!
Related
I'm trying to scrape some data from the following wikipedia table:
Link: https://en.wikipedia.org/wiki/Aire-la-Ville
I am using this code to scrape Area, Elevation, and density using css selectors. I am storing the data in canton_table but only getting the elevation data and not for the other variables.
My code:
# Get labels and data
labels <- current_html %>% html_elements(css = ".infobox-label") %>% html_text()
data <- current_html %>% html_elements(css = ".infobox-data") %>% html_text()
Output for labels and data variables:
> labels
[1] "Country" "Canton" "District"
[4] " • Mayor" " • Total" "Elevation"
[7] " • Total" " • Density" "Time zone"
[10] " • Summer (DST)" "Postal code(s)" "SFOS number"
[13] "Surrounded by" "Website"
>data
[1] "Switzerland"
[2] "Geneva"
[3] "n.a."
[4] "MaireRaymond Gavillet"
[5] "6.50 km2 (2.51 sq mi)"
[6] "428 m (1,404 ft)"
[7] "11,609"
[8] "1,800/km2 (4,600/sq mi)"
[9] "UTC+01:00 (Central European Time)"
[10] "UTC+02:00 (Central European Summer Time)"
[11] "1234,1255"
[12] "6645"
[13] "Bossey (FR-74), Carouge, Chêne-Bougeries, Étrembières (FR-74), Gaillard (FR-74), Geneva (Genève), Plan-les-Ouates, Thônex, Troinex"
[14] "www.veyrier.ch SFSO statistics"
I am able to populate the table with only elevation data and not area and density. Please help. Thanks!
# Clean text and store in data frame
canton_table[canton_table$name == current_name, "area"] <- helper_function(" • Total", labels, data)
canton_table[canton_table$name == current_name, "elevation"] <- helper_function("Elevation", labels, data)
canton_table[canton_table$name == current_name, "density"] <- helper_function(" • Density", labels, data)
My output table:
My output table:
You have to change the labels name in the labels array, from " • Total" to "Total" eccs.
The names like this " • Total" are probably giving references problems.
And then create the table
canton_table[canton_table$name == current_name, "area"] <- helper_function("Total", labels, data)
I am trying to expand financial tables on yahoo finance with rvest.
url <- "https://finance.yahoo.com/quote/AEFES.IS/balance-sheet?p=AEFES.IS"
tic.nodes url.session %>%
html_elements(".fi-row") %>%
html_elements("[title]") %>%
html_text()
[1] "Total Revenue" "Cost of Revenue"
[3] "Gross Profit" "Operating Expense"
[5] "Operating Income" "Net Non Operating Interest Income Expense"
[7] "Pretax Income" "Tax Provision"
[9] "Net Income Common Stockholders" "Diluted NI Available to Com Stockholders"
[11] "Basic EPS" "Diluted EPS"
[13] "Basic Average Shares" "Diluted Average Shares"
[15] "Total Operating Income as Reported" "Rent Expense Supplemental"
[17] "Total Expenses" "Net Income from Continuing & Discontinued Operation"
[19] "Normalized Income" "Interest Income"
[21] "Interest Expense" "Net Interest Income"
[23] "EBIT" "EBITDA"
[25] "Reconciled Cost of Revenue" "Reconciled Depreciation"
[27] "Net Income from Continuing Operation Net Minority Interest" "Total Unusual Items Excluding Goodwill"
[29] "Total Unusual Items" "Normalized EBITDA"
[31] "Tax Rate for Calcs" "Tax Effect of Unusual Items"
However, the expanded table has 47 rows. On HTML code all lines start with fi-row but on the code, it won't take the under divisions.expand income statement can you guys help me please?
I used rvest package in R to scrape some web data but I am having a lot of trouble getting it into a usuable format.
My data currently looks like this:
test
[1] "v. Philadelphia"
[2] "TD GardenRegular Season"
[3] "PTS: 23. Jayson TatumREB: 10. M. MorrisAST: 7. Kyrie Irving"
[4] "PTS: 23. Joel EmbiidREB: 15. Ben SimmonsAST: 8. Ben Simmons"
[5] "100.7 - 83.4"
[6] "# Toronto"
[7] "Air Canada Centre Regular Season"
[8] "PTS: 21. Kyrie IrvingREB: 10. Al HorfordAST: 9. Al Horford"
[9] "PTS: 31. K. LeonardREB: 10. K. LeonardAST: 7. F. VanVleet"
[10] "115.6 - 103.3"
Can someone help me perform the correct operations in order to have it look like this (as a data frame) and provide the code, I would really appreciate it:
Opponent Venue
Philadelphia TD Garden
Toronto Air Canada Centre
I do not need any of the other information.
Let me know if there are any issues :)
# put your data in here
input <- c("v. Philadelphia", "TD GardenRegular Season",
"", "", "",
"# Toronto", "Air Canada Centre Regular Season",
"", "", "")
index <- 1:length(input)
# raw table format
out_raw <- data.frame(Opponent = input[index%%5==1],
Venue = input[index%%5==2])
# using stringi package
install.packages("stringi")
library(stringi)
# copy and clean up
out_clean <- out_raw
out_clean$Opponent <- stri_extract_last_regex(out_raw$Opponent, "(?<=\\s).*$")
out_clean$Venue <- trimws(gsub("Regular Season", "", out_raw$Venue))
out_clean
I am trying to rename all of these atrocious column names in a data frame I received from a government agency.
> colnames(thedata)
[1] "Region" "Resource Assessment Site ID"
[3] "Site Name/Facility" "Design Head (feet)"
[5] "Design Flow (cfs)" "Installed Capacity (kW)"
[7] "Annual Production (MWh)" "Plant Factor"
[9] "Total Construction Cost (1,000 $)" "Annual O&M Cost (1,000 $)"
[11] "Cost per Installed Capacity ($/kW)" "Benefit Cost Ratio with Green Incentives"
[13] "IRR with Green Incentives" "Benefit Cost Ratio without Green Incentives"
[15] "IRR without Green Incentives"
The column headers have special non-alphanumeric characters and spaces, so referring to them is impossible so I have to rename them. I would like to replace all non-alphanumeric characters with a period. But I tried:
old.col.names <- colnames(thedata)
new.col.names <- gsub("^a-z0-9", ".", old.col.names)
The ^ is a "not" delineation, so I thought it would replace everything that is not alphanumeric with a period in the old.col.names.
Can anyone help?
Here are three options to consider:
make.names(x)
gsub("[^A-Za-z0-9]", ".", x)
names(janitor::clean_names(setNames(data.frame(matrix(NA, ncol = length(x))), x)))
Here's what each looks like:
make.names(x)
## [1] "Region" "Resource.Assessment.Site.ID"
## [3] "Site.Name.Facility" "Design.Head..feet."
## [5] "Design.Flow..cfs." "Installed.Capacity..kW."
## [7] "Annual.Production..MWh." "Plant.Factor"
## [9] "Total.Construction.Cost..1.000..." "Annual.O.M.Cost..1.000..."
## [11] "Cost.per.Installed.Capacity....kW." "Benefit.Cost.Ratio.with.Green.Incentives"
## [13] "IRR.with.Green.Incentives" "Benefit.Cost.Ratio.without.Green.Incentives"
## [15] "IRR.without.Green.Incentives"
gsub("[^A-Za-z0-9]", ".", x)
## [1] "Region" "Resource.Assessment.Site.ID"
## [3] "Site.Name.Facility" "Design.Head..feet."
## [5] "Design.Flow..cfs." "Installed.Capacity..kW."
## [7] "Annual.Production..MWh." "Plant.Factor"
## [9] "Total.Construction.Cost..1.000..." "Annual.O.M.Cost..1.000..."
## [11] "Cost.per.Installed.Capacity....kW." "Benefit.Cost.Ratio.with.Green.Incentives"
## [13] "IRR.with.Green.Incentives" "Benefit.Cost.Ratio.without.Green.Incentives"
## [15] "IRR.without.Green.Incentives"
library(janitor)
names(clean_names(setNames(data.frame(matrix(NA, ncol = length(x))), x)))
## [1] "region" "resource_assessment_site_id"
## [3] "site_name_facility" "design_head_feet"
## [5] "design_flow_cfs" "installed_capacity_kw"
## [7] "annual_production_mwh" "plant_factor"
## [9] "total_construction_cost_1_000" "annual_o_m_cost_1_000"
## [11] "cost_per_installed_capacity_kw" "benefit_cost_ratio_with_green_incentives"
## [13] "irr_with_green_incentives" "benefit_cost_ratio_without_green_incentives"
## [15] "irr_without_green_incentives"
Sample data:
x <- c("Region", "Resource Assessment Site ID", "Site Name/Facility",
"Design Head (feet)", "Design Flow (cfs)", "Installed Capacity (kW)",
"Annual Production (MWh)", "Plant Factor", "Total Construction Cost (1,000 $)",
"Annual O&M Cost (1,000 $)", "Cost per Installed Capacity ($/kW)",
"Benefit Cost Ratio with Green Incentives", "IRR with Green Incentives",
"Benefit Cost Ratio without Green Incentives", "IRR without Green Incentives")
I am trying to extract pieces of the string and creating new variables from those matched patterns. I have tried numerous of functions from the "strings" package and can't seem to get the outcome. The example below is made up data. I want to take a character string and extract the pieces and store them into new columns of a new data frame.
example
ex <- c("The Accountant (2016)Crime (vodmovies112.blogspot.com.es)","Miss Peregrine's Home for Peculiar Children (2016)FantasySci-Fi (vodmovies112.blogspot.com.es),"Fantastic Beasts And Where To Find Them (2016) TSAdventure (openload.co)","Ben-Hur (2016) HDActionAdventure (vodmovies112.blogspot.com.es)","The Remains (2016) 1080p BlurayHorror (openload.co)" ,"Suicide Squad (2016) HDAction (openload.co)")
>ex
[1] "The Accountant (2016)Crime (vodmovies112.blogspot.com.es)"
[2] "Miss Peregrine's Home for Peculiar Children (2016)FantasySci-Fi (vodmovies112.blogspot.com.es)"
[3] "Fantastic Beasts And Where To Find Them (2016) TSAdventure (openload.co)"
[4] "Ben-Hur (2016) HDActionAdventure (vodmovies112.blogspot.com.es)"
[5] "The Remains (2016) 1080p BlurayHorror (openload.co)"
[6] "Suicide Squad (2016) HDAction (openload.co)"
genres <- c("Action","Adventure","Animation","Biography",
"Comedy","Crime","Documentary","Drama","Family",
"Fantasy","Film-Noir","History","Horror","Music",
"Musical","Mystery","Romance","Sci-Fi","Sport","Thriller",
"War","Western")
genres <- paste0("^",genres,"|")
genres[22] <- "^Western"
> genres
[1] "^Action|" "^Adventure|" "^Animation|" "^Biography|"
[5] "^Comedy|" "^Crime|" "^Documentary|" "^Drama|"
[9] "^Family|" "^Fantasy|" "^Film-Noir|" "^History|"
[13] "^Horror|" "^Music|" "^Musical|" "^Mystery|"
[17] "^Romance|" "^Sci-Fi|" "^Sport|" "^Thriller|"
[21] "^War|" "^Western"
trying to accomplish
> df
title year domain genre
1 The Accountant 2016 vodmovies112.blogspot.com.es Crime
Here is a possibility:
temp <- strsplit(ex, "\\(|\\)")
df <- setNames(as.data.frame(lapply(1:4,function(i) sapply(temp,"[",i)), stringsAsFactors = FALSE), c("title", "year", "genre", "domain"))
df <- df[ , c("title", "year", "domain", "genre")]
correct <- sapply(seq_along(df$genre), function(y) which(lengths(sapply(seq_along(genres), function(x) grep(genres[x], df$genre[y])))>0))
correct <- lapply(correct, function(x) paste0(genres[x], collapse = " "))
df$genre <- unlist(correct)
df
# title year domain genre
# 1 The Accountant 2016 vodmovies112.blogspot.com.es Crime
# 2 Miss Peregrine's Home for Peculiar Children 2016 vodmovies112.blogspot.com.es Fantasy Sci-Fi
# 3 Fantastic Beasts And Where To Find Them 2016 openload.co Adventure
# 4 Ben-Hur 2016 vodmovies112.blogspot.com.es Action Adventure
# 5 The Remains 2016 openload.co Horror
# 6 Suicide Squad 2016 openload.co Action
Basically, we split the vector ex in 4 parts, delimited by the parenthesis. We then create the data.frame df with the 4 columns.
The hardest part is to correctly extract the genre (as there might be more than one genre per movie). I use a combination of sapply, lapply and grep to do it. When that's done, we "correct" the column genre.
Here is your data:
ex <- c("The Accountant (2016)Crime (vodmovies112.blogspot.com.es)",
"Miss Peregrine's Home for Peculiar Children (2016)FantasySci-Fi (vodmovies112.blogspot.com.es)",
"Fantastic Beasts And Where To Find Them (2016) TSAdventure (openload.co)",
"Ben-Hur (2016) HDActionAdventure (vodmovies112.blogspot.com.es)",
"The Remains (2016) 1080p BlurayHorror (openload.co)", "Suicide Squad (2016) HDAction (openload.co)"
)
genres <- c("Action", "Adventure", "Animation", "Biography", "Comedy",
"Crime", "Documentary", "Drama", "Family", "Fantasy", "Film-Noir",
"History", "Horror", "Music", "Musical", "Mystery", "Romance",
"Sci-Fi", "Sport", "Thriller", "War", "Western")
Another possibility using tidyverse:
library(tidyverse)
data_frame(x = ex) %>%
extract(
x,
c("title", "year", "domain", "genre"),
"(^[^(]+)\\s+\\((\\d{4})\\)\\s*([^(]+)\\s+\\(([^)]+)"
)
## title year domain genre
## * <chr> <chr> <chr> <chr>
## 1 The Accountant 2016 Crime vodmovies112.blogspot.com.es
## 2 Miss Peregrine's Home for Peculiar Children 2016 FantasySci-Fi vodmovies112.blogspot.com.es
## 3 Fantastic Beasts And Where To Find Them 2016 TSAdventure openload.co
## 4 Ben-Hur 2016 HDActionAdventure vodmovies112.blogspot.com.es
## 5 The Remains 2016 1080p BlurayHorror openload.co
## 6 Suicide Squad 2016 HDAction openload.co