Easiest way to scrape webpages to save to .csv - web-scraping

There is a page I want to scrape, you can pass it variables in the URL and it generates specific content. All the content is in a giant HTML table.
I am looking for a way to write a script that can go through 180 of these different pages, extract specific information from certain columns in the table, do some math, and then write them to a .csv file. That way I can do further analysis myself on the data.
What is the easiest way to scrape webpages, parse HTML and then store the data to a .csv file?
I have done stuff similar in python and PHP, the parsing of HTML is not the most easiest thing to do, or cleanest. Are there other routes that are easier?

If you have some experience with python, I would recommend something like BeautifulSoup, or in PHP you can use PhPQuery.
Once you know how to use the HTML-parser, then you can create a "pipes-and-filter" program to do the math and dump it to a csv file.
Have a look at this question for more info on a Python solution.

Related

Looking for a web scraper tool to extract entire table out of web pages and put them in different sheets in excel

The first column of this table contains all the links I have to work with: https://www.metabolomicsworkbench.org/data/DRCCStudySummary.php?Mode=StudySummary&SortBy=Analysis&AscDesc=asc&ResultsPerPage=2000
From each of the links I have to download entire tables like this: https://www.metabolomicsworkbench.org/data/show_metabolites_by_study.php?STUDY_ID=ST000886&SORTFIELD=moverz_quant
and put each of the table from each of the links into separate sheets in excel.
I'd highly appreciate if anyone could tell me how to automate the entire process.
P.S.: I can't code...
ParseHub is a tool that is free and powerful web scraper to scrape data from tables.
I have used it in the past by following this step by step description:
NO CODING NEEDED.

Recommendation: Scrape data from basic html table (external url) into another html table

I'd love to get your recommendations for scraping data from a simple html table into my own html table. No need to extract into formats like .csv and so on. When I view my table I want the external data (from diff url) to load. What tool would you recommend me? I find it as pretty straightforward do you think is it possible to do this within JavaScript included in html file?
Thanks!

Multiple pdf files in one embed

I need your help over a problem I have. Actually, I have a page with a simple embed which displays a PDF file.
I got a request to add another PDF file to the same embed (or at least to do something which would look like it).
I searched some solutions and not finding a simple one, I'm thinking about using iTextSharp to merge both files (by getting their stream from their url), merging them into a new pdf file and display this resulting file into the embed.
But I'm just telling myself it's a bit too much for such a simple modification... And so I'm here asking you if someone would have a better idea ? From what I searched on stackoverflow and google it looks like I will have to take the merge solution but hey, we never know '^^
A simpler option would be to merge the two PDF files using either a free online tool or Adobe Combine Files option and then adding that newly combined PDF to your site. Unless I am missing something, there is no real reason or benefit to do this using code.

Scraping a web page & Formatting it

I need some pointers on how to go about solving this problem:
I have more than 10K + simple HTML web pages which all have the same format. When I say "same format", I mean that they all will have the same h1 tag at the begining but with a varying text and followed by a table and then followed by a link, etc. So, if you see, the basic HTML skeleton of the 10K+ pages are the same but just that the text will keep varying.
I have a way to iterate through all those 10K pages. I however do not know how I can copy specific text in that page onto a XLS/CSV column-wise. Once I can achieve this I will import this excel sheet into MySQL and do further processing.
I know PHP to a certain extent. So, this is what I can think of:
$html = file_get_contents("http://www.SomeWebsite.com/");
I then can use some RegEx to manipulate the data I need. I however do not know how to handle redirects.
This is what I can think of but is there anything better? May be an existing tool or better scripting languages?
You may use HTQL to extract the html content. It has Python and COM interfaces. see: http://htql.net/
To extract the <h1> tag, simply use "<h1>" as the query.
You could do this with PHP, though I recommend XPath instead of regular expressions.
Personally I use Python with lxml and this webscraping library.

List of objects to Excel Spreadsheet?

Anyone know of some code out there that does this already? I have a bunch of pages with data grids on them in an admin website they want to export them to Excel, was hoping someone had this written already - or if not I'll post mine when I am done.
Excel can open files in the format of CSV and XML - it can also generate the schema file.
if you have data grids then the chances are you have data sets.
There is a method on the dataset to output as XML.
you probably have to use an xmlWriter to save the file.

Resources