Web scrape excel file in different date

Web scrape excel file in different date - web-scraping

I'm a newbie to beautiful soup. Can anyone suggest how to scrape the excel file for the past 14 days? My understanding is to loop over the date and save the file. Thanks
https://www.hkexnews.hk/reports/sharerepur/sbn.asp

import requests
from bs4 import BeautifulSoup
res=requests.get("https://www.hkexnews.hk/reports/sharerepur/sbn.asp")
soup=BeautifulSoup(res.text,"lxml")
Now we will find data inside table using find method and use find_all to get all td tags and append data to list lst.
main_data=soup.find("table").find_all("td")
lst=[]
for data in main_data:
try:
url=data.find("a").get('href')[1:]
main_url="https://www.hkexnews.hk/reports/sharerepur"+url
lst.append(main_url)
except AttributeError:
pass
Now iterate through lst and call individual URL to download data to excel file.
for url in range(len(lst)):
resp=requests.get(lst[url])
output = open(f'test_{url}.xls', 'wb')
output.write(resp.content)
output.close()
print(url)
Image: (File being created in Local)

Related

How import excel file from the browser

I want to use GET() function from httr package, because this is just an example file and in the original file I need to write in user name and password i.e.
library(httr)
filename<-"filename_in_url.xls"
URL <- "originalurl"
GET(URL, authenticate("usr", "pwd"), write_disk(paste0("C:/Temp/temp/",filename), overwrite = TRUE))
As a test, I tried to import one of the files from I want to import one of the files from https://www.nordpoolgroup.com/historical-market-data/ and do not save it to the disk, but save it to the environment in order to see the data. However, it also does not work.
library(XML)
library(RCurl)
excel <- readHTMLTable(htmlTreeParse(getURL(paste("https://www.nordpoolgroup.com/4a4c6b/globalassets/marketdata-excel-files/elspot-prices_2021_hourly_eur.xls")), useInternalNodes=TRUE))[[1]]
Or if there are other ways how to import data (functions where login information can be as an input)m it will be great to see them

How to get missing HTML data when web scraping with python-requests

I am working on building a job board which involves scraping job data from company sites. I am currently trying to scrape Twilio at https://www.twilio.com/company/jobs. However, I am not getting the job data its self -- that seems to be being missed by the scraper. Based on other questions this could be because the data is in JavaScript, but that is not obvious.
Here is the code I am using:
# Set the URL you want to webscrape from
url = 'https://www.twilio.com/company/jobs'
# Connect to the URL
response = requests.get(url)
if "_job-title" in response.text:
print "Found the jobs!" # FAILS
# Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(response.text, "html.parser")
# To download the whole data set, let's do a for loop through all a tags
for i in range(0,len(soup.findAll('a', class_='_job'))): # href=True))): #'a' tags are for links
one_a_tag = soup.findAll('a', class_='_job')[i]
link = one_a_tag['href']
print link # FAILS
Nothing displays when this code is run. I have tried using urllib2 as well and that has the same problem. Selenium works but it is too slow for the job. Scrapy looks like it could be promising but I am having install issues with it.
Here is a screenshot of the data I am trying to access:

Basic info for all the jobs at different offices comes back dynamically from an API call you can find in network tab. If you extract the ids from that you can then make separate requests for the detailed job info using those ids. Example as shown:
import requests
from bs4 import BeautifulSoup as bs
listings = {}
with requests.Session() as s:
r = s.get('https://api.greenhouse.io/v1/boards/twilio/offices').json()
for office in r['offices']:
for dept in office['departments']: #you could perform some filtering here or later on
if 'jobs' in dept:
for job in dept['jobs']:
listings[job['id']] = job #store basic job info in dict
for key in listings.keys():
r = s.get(f'https://boards.greenhouse.io/twilio/jobs/{key}')
soup = bs(r.content, 'lxml')
job['soup'] = soup #store soup from detail page
print(soup.select_one('.app-title').text) #print example something from page

Importing Excel in Watson Studio

I am trying to read an excel file (xlsx) into a data frame in ibm watson studio. the excel file is saved in my list of assets. i'm a bit new to python
i have tried creating a project token with some help i got here. I will appreciate if someone helps with the complete code.
i tried this
from project_lib import Project
project = Project(project_id='',
project_access_token='')
pc = project.project_context
file = project.get_file("xx.xlsx")
file.sheet_names
df = pd.ExcelFile(file)
df = file.parse (0)
df.head ()
i needed to pass the excel file into a pandas data frame , pd for eg.

All you need to do is
First insert the project token as you already did.
Then simply fetch file and then do .seek(0),
Then read it using pandas' read_excel() and you should be able to read it.
# Fetch the file
my_file = project.get_file("tests-example.xls")
# Read the CSV data file from the object storage into a pandas DataFrame
my_file.seek(0)
import pandas as pd
pd.read_excel(my_file, nrows=10)
For more information:- https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/project-lib-python.html

How to use R to read XML data from S3 more quickly?

this is my first time working with XML data, and I'd appreciate any help/advice that you can offer!
I'm working on pulling some data that is stored on AWS in a collection of XML files. I have an index files that contains a list of the ~200,000 URLs where the XML files are hosted. I'm currently using the XML package in R to loop through each URL and pull the data from the node that I'm interested in. This is working fine, but with so many URLs, this loop takes around 12 hours to finish.
Here's a simplified version of my code. The index file contains the list of URLs. The parsed XML files aren't very large (stored as dat in this example...R tells me they're 432 bytes). I've put NodeOfInterest in as a placeholder for the spot where I'd normally list the XML tag that I'd like to pull data from.
for (i in 1:200000) {
url <- paste('http://s3.amazonaws.com/',index[i,9],'_public.xml', sep="") ## create URL based off of index file
dat <- (xmlTreeParse(url, useInternal = TRUE)) ## load entire XML file
nodes <- (getNodeSet(dat, "//x:NodeOfInterest", "x")) ##find node for the tag I'm interested in
if (length(nodes) > 0 & exists("dat")) {
dat2 <- xmlToDataFrame(nodes) ##create data table from node
compiled_data <- rbind(compiled_data, dat2) ##create master file
rm(dat2)
}
print(i)
}
It seems like there must be a more efficient way to pull this data. I think the longest step (by far) is loading the XML into memory, but I haven't found anything out there that suggests another option. Any advice???
Thanks in advance!

If parsing the XML into a tree is your pinchpoint (in xmlTreeParse) maybe use a streaming interface like SAX which will allow you to only process those elements that are useful for your application. I haven't used it, but the package xml2 is built on top of libxml2 which provides a SAX ability.

Appending r output in a single sheet of xlsx file

How can i append my R outputs in a single sheet of xlsx file? I am currently working on web crawling wherein i need to scrap the user reviews from website and save it in my deskstop in xlsx format. I need to every time change the website url(as user reviews are in different pages) in my code and save the output in one sheet of xlsx file.
Can you please help me with the code of appending outputs in a single sheet of xlsx file? Below is the code which i am using: Every time i need to change the website url and run the same below code and save the corresponding output in a single sheet of mydata.xlsx
library("rvest")
htmlpage <- html("http://www.glassdoor.com/GD/Reviews/Symphony-Teleca-Reviews-E28614_P2.htm?sort.sortType=RD&sort.ascending=false&filter.employmentStatus=REGULAR&filter.employmentStatus=PART_TIME&filter.employmentStatus=UNKNOWN")
proshtml <- html_nodes(htmlpage, ".pros")
pros <- html_text(proshtml)
pros
data=data.frame(pros)
library(xlsx)
write.xlsx(data, "D:/mydata.xlsx", append=TRUE)

A trivial, but super-slow way:
If you only need to add (a few) row(s) to an existing Excel file, and it only has one sheet to which you want to append, you can just do a simple read => overwrite step:
SHEET.NAME <- '...' # fill in with yours
existing.data <- read.xlsx(file, sheetName = SHEET.NAME)
new.data <- rbind(existing.data, data)
write.xlsx(new.data, file, sheetName = SHEET.NAME, row.names = F, append = F)
Note:
It's quite slow in general, will work only for small scale
read.xlsx is a slow function. Try read.xlsx2 to make it much faster (see the difference in the docs)
If your R process is run once and keeps working for a long time, obviously don't do it this way (reading and overwriting a file is ridiculous in that case)

look at package xlsx.
?write.xlsx will show you what you want. append=TRUE is the key.
========= EDIT TO CORRECT =========
As #Jakub pointed out, append=TRUE adds another worksheet to the file.
========= EDIT TO ADD: ANOTHER METHOD ==========
Another method is to save the data to a .csv file, which could easily open from excel. In this case, the append=T works as expected (adding to the existing sheet):
write.table(df,"D:/MyFile.csv",append=T,sep=",")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web scrape excel file in different date - web-scraping

I'm a newbie to beautiful soup. Can anyone suggest how to scrape the excel file for the past 14 days? My understanding is to loop over the date and save the file. Thanks https://www.hkexnews.hk/reports/sharerepur/sbn.asp

Related

How import excel file from the browser

How to get missing HTML data when web scraping with python-requests

Importing Excel in Watson Studio

How to use R to read XML data from S3 more quickly?

Appending r output in a single sheet of xlsx file

Categories

Resources