Loop through website links and get PDF's to my computer - web-scraping

This topic is related to Loop through links and download PDF's
I am trying to convert my current VBA code into VBScript. I have already understood that I have to remove the variable types (As ... part of Dim statements) and use CreatObject to get those objects but otherwise everything should port as-is. DoEvents will also have to be replaced with something like Wscript.sleep.
I came up with some problems. Currently while running VBS file I am getting an error saying "Object required: 'MSHTML'". Pointing to line 65, where I have Set hDoc = MSHTML.HTMLDocument. I have tried to search on Google but got nothing helpful for this one.
How I should proceed with this one?
DownloadFiles("https://www.nordicwater.com/products/waste-water/")
Sub DownloadFiles(p_sURL)
Set xHttp = CreateObject("Microsoft.XMLHTTP")
Dim xHttp
Dim hDoc
Dim Anchors
Dim Anchor
Dim sPath
Dim wholeURL
Dim internet
Dim internetdata
Dim internetlink
Dim internetinnerlink
Dim arrLinks
Dim sLink
Dim iLinkCount
Dim iCounter
Dim sLinks
Set internet = CreateObject("InternetExplorer.Application")
internet.Visible = False
internet.navigate (p_sURL)
Do Until internet.ReadyState = 4
Wscript.Sleep 100
Loop
Set internetdata = internet.document
Set internetlink = internetdata.getElementsByTagName("a")
i = 1
For Each internetinnerlink In internetlink
If Left(internetinnerlink, 36) = "https://www.nordicwater.com/product/" Then
If sLinks <> "" Then sLinks = sLinks & vbCrLf
sLinks = sLinks & internetinnerlink.href
i = i + 1
Else
End If
Next
wholeURL = "https://www.nordicwater.com/"
sPath = "C:\temp\"
arrLinks = Split(sLinks, vbCrLf)
iLinkCount = UBound(arrLinks) + 1
For iCounter = 1 To iLinkCount
sLink = arrLinks(iCounter - 1)
'Get the directory listing
xHttp.Open "GET", sLink
xHttp.send
'Wait for the page to load
Do Until xHttp.ReadyState = 4
Wscript.Sleep 100
Loop
'Put the page in an HTML document
Set hDoc = MSHTML.HTMLDocument
hDoc.body.innerHTML = xHttp.responseText
'Loop through the hyperlinks on the directory listing
Set Anchors = hDoc.getElementsByTagName("a")
For Each Anchor In Anchors
'test the pathname to see if it matches your pattern
If Anchor.pathname Like "*.pdf" Then
xHttp.Open "GET", wholeURL & Anchor.pathname, False
xHttp.send
With CreateObject("Adodb.Stream")
.Type = 1
.Open
.write xHttp.responseBody
.SaveToFile sPath & getName(wholeURL & Anchor.pathname), 2 '//overwrite
End With
End If
Next
Next
End Sub
Function:
Function getName(pf)
getName = Split(pf, "/")(UBound(Split(pf, "/")))
End Function

Instead of Set hDoc = MSHTML.HTMLDocument, use:
Set hDoc = CreateObject("htmlfile")
In VBA/VB6 you can specify variable and object types but not with VBScript. You have to use CreateObject (or GetObject: GetObject function) to instantiate objects like MSHTML.HTMLDocument, Microsoft.XMLHTTP, InternetExplorer.Application, etc instead of declaring those using Dim objIE As InternetExplorer.Application for example.
Another change:
If Anchor.pathname Like "*.pdf" Then
can be written using StrComp function:
If StrComp(Right(Anchor.pathname, 4), ".pdf", vbTextCompare) = 0 Then
or using InStr function:
If InStr(Anchor.pathname, ".pdf") > 0 Then
Also, at the beginning of your sub, you do the following:
Set xHttp = CreateObject("Microsoft.XMLHTTP")
Dim xHttp
You should declare your variables before assigning them values or objects. In VBScript this is very relaxed, your code will work because VBScript will create undefined variables for you but it's good practice to Dim your variables before using them.
Except for Wscript.sleep commands, your VBScript code will work in VB6/VBA so you can debug your script in VB6 or VBA apps (like Excel).

Related

Converting HTML to PDF and attaching to email .NET

I am looking to use PDFSharp to convert a HTML page into a PDF. This then is attached into an email and sent all in one go.
So, I have a aspx page and vb code file in which gets called through a database SQL job.
Protected Sub Page_Load(sender As Object, e As EventArgs) Handles Me.Load
Dim ReqUrl As String, WorkflowID As String = String.Empty
Using con As New SqlConnection(GlobalVariables.ConStr)
Using com As New SqlCommand("EXEC App.GetWorkflowToSend", con)
con.Open()
Using dr = com.ExecuteReader
Try
While dr.Read
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12
ReqUrl = HttpContext.Current.Request.Url.GetLeftPart(UriPartial.Authority) + dr.Item("WorkflowLink")
WorkflowID = dr.Item("WorkflowID")
Dim r As String = HttpContext.Current.Request.Url.GetLeftPart(UriPartial.Authority) + dr.Item("WorkflowLink")
Dim p As String = Server.MapPath("~\Data\Files") + "\" + WorkflowID + ".pdf"
Dim t As Thread = New Thread(CType(
Function()
ConvertHTML(r, p)
SendMail(Nothing, EmailFrom, "email#address", "New PDF Generated " + WorkflowID, r + "<br/>" + p + "<br/>" + WorkflowID, EmailUser, EmailPass, EmailHost, EmailPort, EmailSSL, "", Nothing, p)
End Function, ThreadStart))
t.SetApartmentState(ApartmentState.STA)
t.Start()
Response.Write(r + "<br>")
Response.Write(p)
End While
Catch
SendMail(Nothing, EmailFrom, "email#address", "Error: " + Err.Description, WorkflowID, EmailUser, EmailPass, EmailHost, EmailPort, EmailSSL, "", Nothing)
End Try
End Using
End Using
End Using
End Sub
In the vb code I essentially call a database stored procedure. This returns some records.
For each of these records, I am currently using HttpContext.Current.Request.Url to make up a string which is essentially the the document url.
I also then declare and specify the location as a String of where I want the converted PDF to be stored.
Public Shared Function ConvertHTML(HTMLPage As String, FileName As String) As String
Dim pngfilename As String = Path.GetTempFileName()
Dim res As String = "" ' = ok
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12
'Try
Using wb As System.Windows.Forms.WebBrowser = New System.Windows.Forms.WebBrowser
wb.ScrollBarsEnabled = False
wb.ScriptErrorsSuppressed = True
wb.Navigate(HTMLPage)
While Not (wb.ReadyState = WebBrowserReadyState.Complete)
Application.DoEvents()
End While
wb.Width = wb.Document.Body.ScrollRectangle.Width
wb.Height = wb.Document.Body.ScrollRectangle.Height
If wb.Height > 3000 Then
wb.Height = 3000
End If
' Get a Bitmap representation of the webpage as it's rendered in the WebBrowser control
Dim b As Bitmap = New System.Drawing.Bitmap(wb.Width, wb.Height)
Dim hr As Integer = b.HorizontalResolution
Dim vr As Integer = b.VerticalResolution
wb.DrawToBitmap(b, New Rectangle(0, 0, wb.Width, wb.Height))
wb.Dispose()
If File.Exists(pngfilename) Then
File.Delete(pngfilename)
End If
b.Save(pngfilename, Imaging.ImageFormat.Png)
b.Dispose()
Using doc As PdfSharp.Pdf.PdfDocument = New PdfSharp.Pdf.PdfDocument
Dim page As PdfSharp.Pdf.PdfPage = New PdfSharp.Pdf.PdfPage()
page.Width = PdfSharp.Drawing.XUnit.FromInch(wb.Width / hr)
page.Height = PdfSharp.Drawing.XUnit.FromInch(wb.Height / vr)
doc.Pages.Add(page)
Dim xgr As PdfSharp.Drawing.XGraphics = PdfSharp.Drawing.XGraphics.FromPdfPage(page)
Dim img As PdfSharp.Drawing.XImage = PdfSharp.Drawing.XImage.FromFile(pngfilename)
xgr.DrawImage(img, 0, 0)
doc.Save(FileName)
doc.Close()
img.Dispose()
xgr.Dispose()
End Using
End Using
Return res
End Function
I run the conversion function with these two strings and finally call a mailing function.
PDF Error
The problem I am having at the moment is the attached PDF I receive in the email doesn't contain the correct output and states 'Navigation to the webpage was cancelled'.
http://127.0.0.1/PDF/TTR/4
C:\inetpub\wwwroot\Prod\Data\Files\4.pdf
I also sent the two strings as output within the email and they look ok.
I'm sure there is something small missing whether that be in my conversion function or just in the main code file.

VB.Net OpenXML Conditional Page Break

I am using the OpenXML library to auto generate Word files. I have a function that takes a group of files and merges them into one document. As I merge a new file into a document, I want each file to start on a new page. But, I don't want to have any blank pages. The code I have mostly works, but an issue comes up is if a file being merged in is a filled page, then a page break is still added, resulting in an empty page being added. I am not sure how to best deal with this, to prevent blank pages from being added. Here is my code:
Public Sub MergeFiles(ByVal filePaths As List(Of String), ByVal fileName As String)
Dim newFile As String = HttpRuntime.AppDomainAppPath & "PDF_Templates\TempFolder\catalog-" & Guid.NewGuid.ToString & ".docx"
File.Copy(fileName, newFile)
Dim counter As Integer = 0
For Each filePath As String In filePaths
Dim wordDoc As WordprocessingDocument = WordprocessingDocument.Open(newFile, True)
Using wordDoc
Dim mainPart As MainDocumentPart = wordDoc.MainDocumentPart
Dim altChunkId As String = "altChunkId" & counter
Dim chunk As AlternativeFormatImportPart = mainPart.AddAlternativeFormatImportPart(AlternativeFormatImportPartType.WordprocessingML, altChunkId)
Dim fileStream As FileStream = File.Open(filePath, FileMode.Open)
Using fileStream
chunk.FeedData(fileStream)
End Using
Dim AltChunk As AltChunk = New AltChunk()
AltChunk.Id = altChunkId
' Dont add a page break to the first page.
If counter > 0 Then
Dim last As OpenXmlElement = wordDoc.MainDocumentPart.Document.Body.Elements().LastOrDefault(Function(e) TypeOf e Is Paragraph OrElse TypeOf e Is AltChunk)
last.InsertAfterSelf(New Paragraph(New Run(New Break() With {
.Type = BreakValues.Page
})))
End If
mainPart.Document.Body.InsertAfter(Of AltChunk)(AltChunk, mainPart.Document.Body.Elements(Of Paragraph).Last())
mainPart.Document.Save()
wordDoc.Close()
End Using
counter = counter + 1
Next
End Sub

Proceed to each link, find file type and download

I am wondering is there any solution to download files from a website with VBscript?
I know how to download a single file from a website but how can I make it into a loop? Also how can I search a particular page for a certain file extension and download the file(s) if available?
For each pdf in website
xhr.open "GET", pdf.src, False
xhr.send
set stream = CreateObject("Adodb.Stream")
with stream
.type = 1
.Open
.Write xhr.responsebody
.SaveToFile "C:\temp\" + CStr(index) + ".pdf", 2
end with
stream.Close
set stream = nothing
index = index + 1
Next
Let's say we have a website https://website.com/productpage/ then there are links that all have the same structure https://website.com/products/xx-x-xx-x/ so all needed links start with https://website.com/products/. There seems to be 33 links of that kind according to source code.
Then after proceeding to some page there are PDF files. Sometimes one, sometimes 3 or 4. However link to the PDF file is something like https://website.com/wp-content/uploads/2016/12/xxxx.pdf where xxxx.pdf can actually be a filename.
Here is what I have managed to get for one file:
dim xHttp: Set xHttp = createobject("Microsoft.XMLHTTP")
dim bStrm: Set bStrm = createobject("Adodb.Stream")
xHttp.Open "GET", "https://website.com/wp-content/uploads/2016/12/xxxx.pdf", False
xHttp.Send
with bStrm
.type = 1 '//binary
.open
.write xHttp.responseBody
.savetofile "c:\temp\xxxx.pdf", 2 '//overwrite
end with
EDIT:
Should it go like:
Get all the needed links
Proceed to each link
Search for links that are ending with ".pdf"
Download files to C:\temp\
Structure of website:
https://website.com/productpage/
https://website.com/products/xx-x/
https://website.com/wp-content/uploads/2016/12/xx-xx.pdf
https://website.com/products/xxxxx-xsx/
https://website.com/wp-content/uploads/2018/12/x-xx-x.pdf
https://website.com/wp-content/uploads/2015/12/x-x-xx.pdf
https://website.com/wp-content/uploads/2019/12/xxx-x.pdf
https://website.com/products/x-xx-xsx/
https://website.com/wp-content/uploads/2014/12/x-xxx.pdf
https://website.com/wp-content/uploads/2013/12/x-x-x-x.pdf
https://website.com/products/xx-x-xsx/
https://website.com/wp-content/uploads/2012/12/x-xxxx.pdf
Since you have the code to save a link, you can wrap it into a sub for re-use:
Sub GetFile(p_sRemoteFile, p_sLocalFile)
Dim xHttp: Set xHttp = CreateObject("Microsoft.XMLHTTP")
Dim bStrm: Set bStrm = CreateObject("Adodb.Stream")
xHttp.open "GET", p_sRemoteFile, False
xHttp.Send
With bStrm
.Type = 1 '//binary
.open
.write xHttp.responseBody
.SaveToFile p_sLocalFile, 2 '//overwrite
End With
End Sub
Then, you can use the InternetExplorer object to get a collection of links in a page:
Sub GetPageLinks(p_sURL)
Dim objIE
Dim objLinks
Dim objLink
Dim iCounter
Set objIE = CreateObject("InternetExplorer.Application")
objIE.Visible = True
objIE.Navigate p_sURL
Do Until objIE.ReadyState = 4
Wscript.Sleep 100
Loop
Set objLinks = objIE.Document.All.Tags("a")
For iCounter = 1 To objLinks.Length
Set objLink = objLinks(iCounter - 1)
With objLink
If StrComp(Right(.href, 3), "pdf", 1) = 0 Then
' Get file
GetFile .href, "C:\temp\downloads\" & GetFileNameFromURL(.href)
Else
' Process page
GetPageLinks .href
End If
End With
Next
End Sub
Here's a function that extracts the file name from a URL:
Function GetFileNameFromURL(p_sURL)
Dim arrFields
arrFields = Split(p_sURL, "/")
GetFileNameFromURL = arrFields(UBound(arrFields))
End Function
This function will return xxxx.pdf given https://website.com/wp-content/uploads/2016/12/xxxx.pdf.

extract attachments from DB to separate folders for each document

Have an assignment to do - it's to extract data from Lotus Notes DB including documents and their attachments. The purpose of this is to put it and store on the Sharepoint as a library.
So far I have managed to create a view and export the data for it to structure in Excel.
Also, I have looked up some Agents examples for extracting the attachments. With implementation of the below script, I managed to export the attachments:
Dim sDir As String
Dim s As NotesSession
Dim w As NotesUIWorkspace
Dim db As NotesDatabase
Dim dc As NotesDocumentCollection
Dim doc As NotesDocument
Sub Initialize
Set s = New NotesSession
Set w = New NotesUIWorkspace
Set db = s.CurrentDatabase
Set dc = db.UnprocessedDocuments
Set doc = dc.GetFirstDocument
Dim rtItem As NotesRichTextItem
Dim RTNames List As String
Dim DOCNames List As String
Dim itemCount As Integer
Dim sDefaultFolder As String
Dim x As Integer
Dim vtDir As Variant
Dim iCount As Integer
Dim j As Integer
Dim lngExportedCount As Long
Dim attachmentObject As Variant
x = MsgBox("This action will extract all attachments From the " & CStr(dc.Count) & _
" document(s) you have selected, And place them into the folder of your choice." & _
Chr(10) & Chr(10) & "Would you like To continue?", 32 + 4, "Export Attachments")
If x <> 6 Then Exit Sub
sDefaultFolder = s.GetEnvironmentString("LPP_ExportAttachments_DefaultFolder")
If sDefaultFolder = "" Then sDefaultFolder = "F:"
vtDir = w.SaveFileDialog( False, "Export attachments To which folder?", "All files|*.*", sDefaultFolder, "Choose Folder and Click Save")
If IsEmpty(vtDir) Then Exit Sub
sDir = StrLeftBack(vtDir(0), "\")
Call s.SetEnvironmentVar("LPP_ExportAttachments_DefaultFolder", sDir)
While Not (doc Is Nothing)
iCount = 0
itemCount = 0
lngExportedCount = 0
Erase RTNames
Erase DocNames
'Scan all items in document
ForAll i In doc.Items
If i.Type = RICHTEXT Then
Set rtItem = doc.GetfirstItem(i.Name)
If Not IsEmpty(rtItem.EmbeddedObjects) Then
RTNames(itemCount) = CStr(i.Name)
itemCount = itemCount +1
End If
End If
End ForAll
For j = 0 To itemCount-1
Set rtItem = Nothing
Set rtItem = doc.GetfirstItem(RTNames(j))
ForAll Obj In rtItem.EmbeddedObjects
If ( Obj.Type = EMBED_ATTACHMENT ) Then
Call ExportAttachment(Obj)
Call doc.Save( False, True )
'creates conflict doc if conflict exists
End If
End ForAll
Next
'Scan all items in document
ForAll i In doc.Items
If i.Type = ATTACHMENT Then
DOCNames(lngExportedCount) = i.Values(0)
lngExportedCount = lngExportedCount + 1
End If
End ForAll
For j% = 0 To lngExportedCount-1
Set attachmentObject = Nothing
Set attachmentObject = doc.GetAttachment(DOCNames(j%))
Call ExportAttachment(attachmentObject)
Call doc.Save( False, True )
'creates conflict doc if conflict exists
Next
Set doc = dc.GetNextDocument(doc)
Wend
MsgBox "Export Complete.", 16, "Finished"
End Sub
Sub ExportAttachment(o As Variant)
Dim sAttachmentName As String
Dim sNum As String
Dim sTemp As String
sAttachmentName = sDir & "\" & o.Source
While Not (Dir$(sAttachmentName, 0) = "")
sNum = Right(StrLeftBack(sAttachmentName, "."), 2)
If IsNumeric(sNum) Then
sTemp = StrLeftBack(sAttachmentName, ".")
sTemp = Left(sTemp, Len(sTemp) - 2)
sAttachmentName = sTemp & Format$(CInt(sNum) + 1, "##00") & _
"." & StrRightBack(sAttachmentName, ".")
Else
sAttachmentName = StrLeftBack(sAttachmentName, ".") & _
"01." & StrRightBack(sAttachmentName, ".")
End If
Wend
Print "Exporting " & sAttachmentName
'Save the file
Call o.ExtractFile( sAttachmentName )
End Sub
So the issue I do have right now is that these attachments are being saved to the same folder, which means that I would manually have to put them into right folders of library (several thousands). Could anyone help on how should I change the above code to have the attachments saved to separate folder for each document from DB?
Also for some reason that I cant find out below line is causing error pop up with "Object Variable not set":
sAttachmentName = sDir & "\" & o.Source
Would anyone know why it causes failure and stops the whole process?
You need to use the MkDir statement to create directory and extract attachments in the folder. Probably write something like:
MkDir sDir
You need to write code that create a new directory for each document (make sure you check if the directory exists, and preferably you make sure each directory has a unique name).
I wrote a tool like that, that exports all the fields of a document into XML, as well as attachments and embedded images. It can be set to separate each document into it's own directory.
You can read more about it ate the link below, perhaps you can get some ideas from the description. I use the UniversalID of teh document to get a unique folder name.
http://www.texasswede.com/websites/texasswede.nsf/Page/Notes%20XML%20Exporter

get_included_files in classic ASP?

Is there an equivalent to PHP's get_included_files in classic ASP?
No, there is not.
A very ugly function for that:
<!--#include file="include/common.asp"-->
<%
Function GetIncludedFiles()
Dim Url
Dim Fso
Dim Fs
Dim Src
Dim Arr
Dim Ret
Dim i
Set Fso = Server.CreateObject("Scripting.FileSystemObject")
ReDim Ret(-1)
Url = Request.ServerVariables("URL")
Set Fs = Fso.OpenTextFile(Server.MapPath(Url))
Src = Fs.Readall()
Fs.Close
Set Fs = Nothing
Set Fso = Nothing
Arr = Split(Src, "<" & "!--#include file=")
For i = 0 To UBound(Arr)
Arr(i) = Left(Arr(i), InStr(Arr(i), "-->"))
Arr(i) = Replace(Arr(i), "-", "")
Arr(i) = Replace(Arr(i), "'", "")
Arr(i) = Trim(Replace(Arr(i), """", ""))
If Arr(i) <> "" Then
ReDim Preserve Ret(UBound(Ret) + 1)
Ret(UBound(Ret)) = Arr(i)
End If
Next
GetIncludedFiles = Ret
End Function
Dim File
For Each File In GetIncludedFiles()
Response.Write File & "<br />"
Next
%>
The simple way is to create a main file in a specific directory (for example /include/mainfile.asp) and then include all the other files to this file. Something like:
<!#include File="[your directory here/file1.asp]"-->
<!#include File="[your directory here/file2.asp]"-->
<!#include File="[your directory here/file3.asp]"-->
Then, You can include your main file using "virtual" to the rest of your pages that you want to access those other included files.
<!#include Virtual="/include/mainfile.asp"-->
Not as such, but I vaguely remember seen a tool or two floating around that will give you the equivalent report. It might have been on Code Project or somewhere similar... its been a long time since I last ran across it.

Resources