How do you scrape multiple pages from same website on Rstudio

How do you scrape multiple pages from same website on Rstudio - r

so I want to download data from multiple pages of the same website using RStudio
https://www.irishjobs.ie/ShowResults.aspx?Keywords=Data&autosuggestEndpoint=%2fautosuggest&Location=0&Category=&Recruiter=Company&btnSubmit=Search&Page=2
The difference between page 2 and page 3, is …at the end of the hyperlink we just have a 3 instead of a 2
I have no problem getting what I need from 25 jobs in 1 page, but I want to get 100 jobs from 4 pages.
I am using the selector gadget chrome extension.
I tried the for loop
for (page_result in seq(from =1, to = 101, by = 25)) {
link = paste0(“ https://www.irishjobs.ie/ShowResults.aspx?Keywords=Data&autosuggestEndpoint=%2fautosuggest&Location=0&Category=&Recruiter=Company&btnSubmit=Search&Page=2)
page = read_html(link)
I can’t figure out how to do it
I think I need to fit in page_result into the link, but I don’t know where.
I welcome any ideas.
i have the rvest package and the dplyr package. But I want the for loop to go through each page. Any idea how best to do this, thanks

4 links can be easily put in for loop.
Copy the CSS link from DOM and iterate over 5 to 30 to get all 25 jobs.
AllJOBS <- vector()
for (i in 1:4) {
print("s")
url <- paste0("https://www.irishjobs.ie/ShowResults.aspx?Keywords=Data&autosuggestEndpoint=%2fautosuggest&Location=0&Category=&Recruiter=Company&btnSubmit=Search&Page=",i,sep="")
for (k in 5:30) {
jobs <- read_html(url) %>% html_node(css = paste0("#page > div.container > div.column-wrap.order-one-two > div.two-thirds > div:nth-child(",k,") > div > div.job-result-logo-title > div.job-result-title > h2 > a")) %>% html_text()
AllJOBS <- append(AllJOBS,jobs)
Sys.sleep(runif(1,1,2))
print(k)
}
print(paste0("Page",i))
}
output
> AllJOBS
[1] "Senior Consultant - Fund Static Data"
[2] "Data Warehouse Engineer"
[3] "Senior Software Engineer - Big Data DevOps"
[4] "HR Data Analyst"
[5] "Data Insights Engineer - Dublin - Permanent/Contract - SQL Server"
[6] NA
[7] "Data Engineer - Master Data Services - SQL Server - Permanent/Contract"
[8] "Senior Data Protection Officer (DPO) - Contract"
[9] "QC Data Analyst (Trending)"
[10] "Senior Data Warehouse Developer"
[11] "Senior Data Analyst FTC"
[12] "Compliance Advisory and Data Protection Relationship Manager"
[13] "Contracts Manager-Data Center"
[14] "Payments Product Data Analyst"
[15] "Data Center Product Hardware Platform Engineer"
[16] "People Data Privacy Program Lead"
[17] "Head of Data Science"
[18] "Data Protection Counsel (Product or Compliance)"
[19] "Data Engineer, GMS"
[20] "Data Protection Associate General Counsel"
[21] "Senior Data Engineer"
[22] "Geospatial Data Scientist"
[23] "Data Solutions Manager"
[24] "Data Protection Solicitor"
[25] "Junior Data Scientist"
[26] "Master Data Specialist"
[27] "Temp QC Electronic Data Management Analyst"
[28] "20725 -Data Scientist - Limerick"
[29] "Technical Support Specialist - Data Centre"
[30] "Lead QC Micro Analyst (data review and compliance)"
[31] "Temp QC Data Analyst"
[32] "#Abbvie Compliance Engineer (Data Integrity)"
[33] "People Data Analyst"
[34] "Senior Electrical Design Engineer - Data Centre Ex"
[35] "Laboratory Data Entry Assistant, UCD NVRL"
[36] "Data Migrations Specialist"
[37] "Data Protection Officer"
[38] "Data Center Operations Engineer (Linux)"
[39] "Senior Electrical Engineer | Data Centre LV Design"
[40] "Data Scientist - (Process Sciences)"
[41] "Mgr Supply Logistics Global Materials Data"
[42] "Data Protection / Privacy Delivery Consultant"
[43] "Global Supply Chain Data Analyst"
[44] "QC Data Analyst"
[45] "0582GradeVIIFOIOLOL1120 - Grade VII Data Protection / Freedom of Information & Compliance Officer"
[46] "DPO001 - Deputy Data Protection Officer (General Manager) Office of the Head of Data Protection, HSE"
[47] "Senior Campaign Data Analyst"
[48] "Data & Reporting Analyst II"
[49] "Azure Data Analytics Solution Architect"
[50] "Head of Risk Assurance for IT, Data, Projects and Outsourcing"
[51] "Trainee Data Technician, Ireland"
[52] NA
You can deal with NAs separately. Does this answer your question or I misinterpreted it?

Related

Yahoo finance expanded financial tables scrape with R

I am trying to expand financial tables on yahoo finance with rvest.
url <- "https://finance.yahoo.com/quote/AEFES.IS/balance-sheet?p=AEFES.IS"
tic.nodes url.session %>%
html_elements(".fi-row") %>%
html_elements("[title]") %>%
html_text()
[1] "Total Revenue" "Cost of Revenue"
[3] "Gross Profit" "Operating Expense"
[5] "Operating Income" "Net Non Operating Interest Income Expense"
[7] "Pretax Income" "Tax Provision"
[9] "Net Income Common Stockholders" "Diluted NI Available to Com Stockholders"
[11] "Basic EPS" "Diluted EPS"
[13] "Basic Average Shares" "Diluted Average Shares"
[15] "Total Operating Income as Reported" "Rent Expense Supplemental"
[17] "Total Expenses" "Net Income from Continuing & Discontinued Operation"
[19] "Normalized Income" "Interest Income"
[21] "Interest Expense" "Net Interest Income"
[23] "EBIT" "EBITDA"
[25] "Reconciled Cost of Revenue" "Reconciled Depreciation"
[27] "Net Income from Continuing Operation Net Minority Interest" "Total Unusual Items Excluding Goodwill"
[29] "Total Unusual Items" "Normalized EBITDA"
[31] "Tax Rate for Calcs" "Tax Effect of Unusual Items"
However, the expanded table has 47 rows. On HTML code all lines start with fi-row but on the code, it won't take the under divisions.expand income statement can you guys help me please?

Scraping keywords on PHP page

I would like to scrape the keywords inside the dropdown table of this webpage https://www.aeaweb.org/jel/guide/jel.php
The problem is that the drop-down menu of each item prevents me from scraping the table directly because it only takes the heading and not the inner content of each item.
rvest::read_html("https://www.aeaweb.org/jel/guide/jel.php") %>%
rvest::html_table()
I thought of scraping each line that starts with Keywords: but I do not get how can I do that. Seems like the HTML is not showing the items inside the table.

A RSelenium solution,
#Start the server
library(RSelenium)
driver = rsDriver(
browser = c("firefox"))
remDr <- driver[["client"]]
#Navigate to the url
remDr$navigate("https://www.aeaweb.org/jel/guide/jel.php")
#xpath of the table
remDr$findElement(using = "xpath",'/html/body/main/div/section/div[4]') -> out
#get text from the table
out <- out$getElementText()
out= out[[1]]
Split using stringr package
library(stringr)
str_split(out, "\n", n = Inf, simplify = FALSE)
[[1]]
[1] "A General Economics and Teaching"
[2] "B History of Economic Thought, Methodology, and Heterodox Approaches"
[3] "C Mathematical and Quantitative Methods"
[4] "D Microeconomics"
[5] "E Macroeconomics and Monetary Economics"
[6] "F International Economics"
[7] "G Financial Economics"
[8] "H Public Economics"
[9] "I Health, Education, and Welfare"
[10] "J Labor and Demographic Economics"
[11] "K Law and Economics"
[12] "L Industrial Organization"
[13] "M Business Administration and Business Economics; Marketing; Accounting; Personnel Economics"
[14] "N Economic History"
[15] "O Economic Development, Innovation, Technological Change, and Growth"
[16] "P Economic Systems"
[17] "Q Agricultural and Natural Resource Economics; Environmental and Ecological Economics"
[18] "R Urban, Rural, Regional, Real Estate, and Transportation Economics"
[19] "Y Miscellaneous Categories"
[20] "Z Other Special Topics"
To get the Keywords for History of Economic Thought, Methodology, and Heterodox Approaches
out1 <- remDr$findElement(using = 'xpath', value = '//*[#id="cl_B"]')
out1$clickElement()
out1 <- remDr$findElement(using = 'xpath', value = '/html/body/main/div/section/div[4]/div[2]/div[2]/div/div/div/div[2]')
out1$getElementText()
[[1]]
[1] "Keywords: History of Economic Thought"

How to read a .txt file into a dataframe with readr?

I have the following data that I obtained from a .txt file using the read_lines function from readr
txtread<-read_lines("expenses_copy1.txt")
txtread
[1] "Amount:Category:Date:Description"
[2] "5.25:supply:20170222:box of staples"
[3] "79.81:meal:20170222:lunch with ABC Corp. clients Al, Bob, and Cy"
[4] "43.00:travel:20170222:cab back to office"
[5] "383.75:travel:20170223:flight to Boston, to visit ABC Corp."
[6] "55.00:travel:20170223:cab to ABC Corp. in Cambridge, MA"
[7] "23.25:meal:20170223:dinner at Logan Airport"
[8] "318.47:supply:20170224:paper, toner, pens, paperclips, tape"
[9] "142.12:meal:20170226:host dinner with ABC clients, Al, Bob, Cy, Dave, Ellie"
[10] "303.94:util:20170227:Peoples Gas"
[11] "121.07:util:20170227:Verizon Wireless"
[12] "7.59:supply:20170227:Python book (used)"
[13] "79.99:supply:20170227:spare 20\" monitor"
[14] "49.86:supply:20170228:Stoch Cal for Finance II"
[15] "6.53:meal:20170302:Dunkin Donuts, drive to Big Inc. near DC"
[16] "127.23:meal:20170302:dinner, Tavern64"
[17] "33.07:meal:20170303:dinner, Uncle Julio's"
[18] "86.00:travel:20170304:mileage, drive to/from Big Inc., Reston, VA"
[19] "22.00:travel:20170304:tolls"
[20] "378.81:travel:20170304:Hyatt Hotel, Reston VA, for Big Inc. meeting"
I want to read each of these in to vectors that are "Amount", "Category", "Date" and "Description" and create a dataframe out of them so that I have a dataset I can work with
I tried the following
for (i in length(txtread) ) {
data<-read.table(textConnection(txtread[[i]]))
print(data)
}
However this does't seem to work.
how can I read this data into a dataframe in R

R to web scrape- using rvest- timeout error

library(rvest)
jobbank <- read_html("https://www.jobbank.gc.ca/LMI_bulletin.do?cid=3373&AREA=0007&INDUSTRYCD=&EVENTCD=")
Error in open.connection(x, "rb") :
Timeout was reached: Connection timed out after 10015 milliseconds
jobbank %>%
html_node(".lmiBox") %>%
html_text()
Error in eval(lhs, parent, parent) : object 'jobbank' not found
I'm trying to find keywords from the news section of the websites but it seems to be showing me these 2 error messages.

Seems to be working fine on my side.
library(rvest)
#> Loading required package: xml2
library(stringr)
jobbank <- read_html("https://www.jobbank.gc.ca/LMI_bulletin.do?cid=3373&AREA=0007&INDUSTRYCD=&EVENTCD=")
jobbank %>%
html_node(".lmiBox") %>%
html_text() %>%
str_split("(\r\\n+\\s+)|(\\n\\s+)")
#> [[1]]
#> [1] ""
#> [2] "Week of Jan 14 - Jan 18, 2019Lowe's Canada is looking to hire about 2,650 full-time, part-time and seasonal staff at its stores in Ontario. The company will hold a National Hiring Day on February 23."
#> [3] "The Ministry of Innovation, Science, and Economic Development announced $5M in funding to support automotive innovation at APAG Elektronik Corp. and Service Mold + Aerospace Inc. in Windsor, creating 160 jobs"
#> [4] "A $1M investment by the provincial government into Kenora's Downtown Revitalization Project for a plaza and infrastructure upgrades will create 75 new jobs"
#> [5] "Redfin Corp., an American real estate brokerage, is expanding into Canada and hiring in Toronto"
#> [6] "The construction of townhomes at Walkerville Stones in Windsor is expected to begin this spring "
#> [7] "The Ontario Emerging Jobs Institute (OEJI) at the Nav Centre in Cornwall opened. The OEJI provides skills training in areas with worker shortages."
#> [8] "The Chartwell Meadowbrook Retirement Residence in Lively broke ground on their expansion project, which includes 41 new suites and 14 town homes"
#> [9] "Lambton College created an Information Technology and Communication Research Centre using a five-year, $2M grant from the Natural Sciences and Engineering Research Council of Canada. They hope to use part of the funding to employ students."
#> [10] "SnapCab, a workspace pod manufacturer in Kingston, has grown from 20 to 25 employees with more hiring expected to occur in 2019"
#> [11] "Niagara Pallet & Recyclers Ltd., a manufacturer of pallets and shipping materials in Smithville, is hiring general labour workers, AZ and DZ drivers, production staff, forklift drivers and saw operators"
#> [12] "A1 Demolition will begin demolition of the former Maliboo Club in Simcoe. The plan is to rebuild the structure with residential and commercial space."
#> [13] "MidiCi: The Neapolitan Pizza Co., Sweet Jesus, La Carnita and The Pie Commission will be among several restaurants opening in the 34,000-sq.-ft. Food District in Mississauga this spring "
#> [14] "Menkes Developments Ltd., in partnership with TD Greystone Asset Management, will renovate the former Canada Permanent Trust Building in Toronto. Work on the 270,000-sq.-ft. space is expected to take between 12 and 18 months."
#> [15] "Westmount Signs & Printing in Waterloo is hiring experienced installers after doubling the size of its workforce to 24 employees in the last year and a half"
#> [16] "Microbrewery, Heral Haus Brewing Co. opened in Stratford at the end of December"
#> [17] "Demolition is expected to start this month on Windsor's old City Hall and is expected to be complete by August"
#> [18] "Urban Planet, a clothing store, will open as early as February 2019 at the Cornwall Square mall in Cornwall"
#> [19] "The federal government committed $3.5M towards the construction of a new art gallery in Thunder Bay, bringing total government funding for the project to $27.5M"
#> [20] "The Rec Room, a 44,000-sq.-ft. entertainment complex by Cineplex Entertainment LP, is scheduled to open in Mississauga in March "
#> [21] "Yang Teashop opened a second location in Toronto with plans to open two more locations in the Greater Toronto Area"
#> [22] "Spacecraft Brewery opened in Sudbury"
#> [23] "The Town of Lakeshore will be accepting applications for 11 summer student positions until March 1"
#> [24] "Virtual reality arcade Cntrl V opened in Lindsay"
#> [25] "A new restaurant, Presqu'ile Café and Burger, opened in Brighton"
#> [26] "Beauty brand Morphe LLC opened a store in Mississauga"
#> [27] "Footwear retailer Brown Shoe Company of Canada Ltd. Inc. will open an outlet store in Halton Hills in April"
#> [28] "The Westdale Theatre in Hamilton is scheduled to reopen in February "
#> [29] "Early ON/Family Grouping will open a child care centre in Monkton"
#> [30] "The De Novo addiction treatment centre opened in Huntsville "
#> [31] "French Revolution Bakery & Crêperie opened in Dundas"
#> [32] "A Williams Fresh Cafe is slated to open in Stoney Creek, one of three new locations opening this year in southwestern Ontario"
#> [33] "Monigram Coffee Midtown cafe will open in Kitchener this winter "
#> [34] "My Roti Place opened a fourth restaurant in Toronto"
#> [35] "A Gangster Cheese restaurant opened in Whitby"
#> [36] "A Copper Branch restaurant opened in Mississauga "
#> [37] "Hallmark Canada will exit about 20 company-owned stores across Canada in 2019 by either transitioning them to independent ownership or closing them. The loacations of the affected stores have not been identified."
#> [38] "Lush Cosmetics at the Intercity Shopping Centre in Thunder Bay will close at the end of January"
#> [39] ""
Created on 2019-01-28 by the reprex package (v0.2.1)

Vexing Regex Using stringr [next line query]

I have made so many attempts at this and must now turn to you. I've seen related posts here on SO but none help. I'm vexed as to why I can't get a list of instruments, which seem to appear on the line following the word Instruments:!
library(RCurl);library(XML);library(rvest);library(dplyr);library(stringr)
A<-"https://www.google.com/search?q=lester+young&oq=lester+young&aqs=chrome..69i57j69i60l2j0l3.1767j1j4&sourceid=chrome&ie=UTF-8"
result<-A %>%
read_html()%>%
html_nodes(xpath="//span")%>%html_text()
# Parse `result` with regex
instruments<-str_extract(result,"(.*Instruments:\n.*)")
instruments
dob<-str_extract(result,".*(Born: \n.*)")
dob
'result' looks like this, in part:
[38] "Lester Willis Young, nicknamed \"Pres\" or \"Prez\", was an American jazz tenor saxophonist and occasional clarinetist.\nComing to prominence while a member of Count Basie's orchestra, Young was one of the most influential players on his instrument. Wikipedia"
[39] "Born: "
[40] "August 27, 1909, Woodville, MS"
[41] "Died: "
[42] "March 15, 1959, New York City, NY"
[43] "Nickname: "
[44] "Prez"
[45] "Instruments: "
[46] "Tenor saxophone, clarinet"
While it's possible to use instruments<-result[46] for this webpage, the HTML scraping yields instrument and dob information on different lines for different searches.
Ultimately, I would like to see "Piano" in the instruments object and a date of birth in the dob object.
Thank you...

This worked for me. Get the index of "Instruments:" and then print the next entry. Of course, if the page format changes, this may not work.
> i <- as.integer(grep("Instruments:",result))
> print(result[i+1])
[1] "Tenor saxophone, clarinet"
or this:
> result_all <- paste(result,collapse="\n")
> str_extract(result_all,"(Instruments:.*\\n.*)")
[1] "Instruments: \nTenor saxophone, clarinet"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How do you scrape multiple pages from same website on Rstudio - r

Related

Yahoo finance expanded financial tables scrape with R

Scraping keywords on PHP page

How to read a .txt file into a dataframe with readr?

R to web scrape- using rvest- timeout error

Vexing Regex Using stringr [next line query]

Categories

Resources