Does anyone know a web crawler tool for collecting contact details from a website? Say I have a www.website/contact.. I want to pull out the address, phone number, etc.. There are 2 tools I've been looking at: cralwer4j opensource jar for java and Scrapy opensource in Python. But I am finding it a bit hard to use for my scenario.
Any suggestions would be great. Thanks
You might google for "simple web crawler" to find a solution that fits you best. In the net there are plenty "pure python" based web crawlers. Based on sceleton code you add db wrap up. I think the most problem would be db setting and saving data in it.
What if there are 1000000s of websites to crawl.. Is there a way to crawl all websites in my are?
No problem for scripting. Just put millions addresses in a file (or files), open it for reading in python or other script. Then get link by link from it and crawl/scrape to your pleasure. Result you might also want to save in file (csv, json).
I'd also recommend you a ready simple python crawler.
Related
Please assist:
I'm familiar with VBA and C++, but not with Java. Now wanting to delve into Office Scripts.
However, I want to know if I can achieve the same as in VBA:
I am logging into niche websites and fetching data in tables using VBA Internet Controls (getElementByID()), etc.
As far as I know, these niche websites do not have an API, as the sample scenario of webscraping on the Microsoft website does:
https://learn.microsoft.com/en-us/office/dev/scripts/resources/scenarios/noaa-data-fetch
I would like to know if I can log onto these websites, and then fetch information using HTML (getElementByID()) or similar?
I am just unsure if I can use Office Scripts directly, or if I require to include some library or something.
Any guidance would be appreciated.
Currently, there is no way to do this through Office Scripts alone. The fetch command and REST APIs are the only ways to get data in a script directly from webservices. If you'd like to request the addition of a specific library, please use the Send feedback button in the Office Scripts Code Editor.
The discussion in the comments about using Power Automate is a reasonable path to pursue. The linked video (https://www.youtube.com/watch?v=_O9eEotCT0U) is a good place to start.
I would like to provide a link on my web site to download a large file. This should be done with scale in mind. What is best efficient way as of today?
Of course i can do a classic way:
<a href="//download.myserver.com/largefile.zip" title="Download via HTTP" >
The problem with this approach is: i dont want traffic to my server to explode with downloads. So I would rather redirect to external hosting for this large file. What is best way to host this file then?
If you want to avoid download traffic to your server, then I personally suggest using Azure Blob Storage. There is lots of documentation and client libraries for .Net. It removes download traffic from your site and the security concerns of hosting files and moves them to the Azure cloud which is very secure to say the least.
If you want the files to be publicly available to anyone, then make a public container, get the url of the file you want and place it in the anchor tag, otherwise you may need to familiarise yourself with the blob leasing (plenty of documentation too). Though like most things it is not free. The silver lining is you only pay for what you use.
You can get started here.
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-dotnet
Disclaimer,
I do not work for Microsoft, nor I do not benefit form this. This is just a personal opinion based on previous experiences and projects.
I'm working with some new SCADA software, which uses a browser environment to display everything. One component that the software has is a PDF viewer, however, since we're in a browser environment, it can only load files that are served up over HTTP. According to the forums, this means that the source of the PDF needs to be a URL.
The forum also notes that I can use one of their modules (WebDev) to "stream the PDF bytes over HTTP", and provides directions for how to do so. However, the WebDev module is outside the budget of my project (it's quite a high-powered module, I'd be paying a premium price and then using 1% of its functionality). So I'm wondering if it's possible to serve up a PDF via HTTP some other way.
I'm not an experienced programmer - I'm self taught out of necessity on a small handful of languages, and to a basic level only. As such, I don't fully understand the problem, nor do I know what search terms to use to find the sort of information I need to solve it.
If anyone's able to provide a partial solution, or even just able to help me understand what I'm asking for and where to go looking for some answers, I'd appreciate it!
The PC hosting the PDF files and the SCADA gateway is running Windows 10.
I had the same issue integrating .pdf report to our SCADA system having web interface and running node.js at backend.
The main point is:
Generate your pdf in client end (web interface)
Convert it to Base64 format as URi
Preview on DOM or send it to server!
Send excel and pdf to server side
hope that helps!
First sorry for my english, I'm a french guy so I don't speak it perfectly don't be to fierce with me ;)
Presently I'm working on a smartphone application developed with cordova, in this application we got the notion of order and I need to create a way for the user to download a PDF with his order details.
For the project, we are using an ASP API, so I think the best way to do it it's to ask to the API to do it for me but still that after some research I havn't find any clue to generate a PDF sendable to my application on the API. Or perhaps that I've miss understood some stuff, but I'm a little bit stuck currently ^^'
Have a good day!
If understood your question right, the easiest approach is to generate PDF file using server side coding and store it in server. You can then access the PDF file from server and store it in device using cordova file transfer and cordova file plugins.
I need to build a website that can be downloaded to a CD.
I'd like to use some CMS (wordpress,Kentico, MojoPortal) to setup my site, and then download it to a cd.
There are many program that know how to download a website to a local drive, but how to make the search work is beyond my understanding.
Any idea???
The project is supposed to be an index of Local community services, for communities without proper internet connection.
If you need to make something that can be viewed from a CD, the best approach is to use only HTML.
WordPress, for example, needs Apache and MySQL to run. And although somebody can "install" the website on his own computer if you supply the content via a CD, most of your users will not be knowledgeable enough to do this task.
Assuming you are just after the content of the site .. in general you should be able to find a tool to "crawl" or mirror most sites and create an offline version that can be burned on a CD (for example, using wget).
This will not produce offline versions of application functionality like search or login, so you would need to design your site with those limitations in mind.
For example:
Make sure your site can be fully navigated without JavaScript (most "crawl" tools will discover pages by following links in the html and will have limited or no JavaScript support).
Include some pages which are directory listings of resources on the site (rather than relying on a search).
Possibly implement your search using a client-side technology like JavaScript that would work offline as well.
Use relative html links for images/javascript, and between pages. The tool you use to create the offline version of the site should ideally be able to rewrite/correct internal links for the site, but it would be best to minimise any need to do so.
Another approach you could consider is distributing using a clientside wiki format, such as TiddlyWiki.
Blurb from the TiddlyWiki site:
TiddlyWiki allows anyone to create personal SelfContained hypertext
documents that can be published to a WebServer, sent by email,
stored in a DropBox or kept on a USB thumb drive to make a WikiOnAStick.
I think you need to clarify what you would like be downloaded to the CD. As Stennie said, you could download the content and anything else you would need to create the site either with a "crawler" or TiddlyWiki, but otherwise I think what you're wanting to develop is actually an application, in which case you would need to do more development than what standard CMS packages would provide. I'm not happy to, but would suggest you look into something like the SalesForce platform. Its a cloud based platform that may facilitate what you're really working towards.
You could create the working CMS on a small web/db server image using VirtualBox and put the virtual disk in a downloadable place. The end user would need the VirtualBox client (free!) and the downloaded virtual disk, but you could configure it to run with minimal effort for the creation, deployment and running phases.