Web Scraping by XMLHTTP - css

I would like to web scraping all the job title and company name from a job search website. However I unable to do so as I believe I cant inspect the correct element in the HTML codes. I researched this for days, please assist and advise on the correct HTML element. Once I able to inspect the correct element and I will do the looping and finish this program. Appreciate.
Website: https://www.efinancialcareers.my/search/?countryCode=MY&radius=40&radiusUnit=km&page=1&pageSize=20&currencyCode=MYR&language=en0
Option Explicit
Sub xmlhttp_scraping()
Dim XMLrequest As New MSXML2.XMLHTTP60
XMLrequest.Open "GET", "https://www.efinancialcareers.my/search/?countryCode=MY&radius=40&radiusUnit=km&page=1&pageSize=20&currencyCode=MYR&language=en0", False
XMLrequest.send
Dim iDOC As New MSHTML.HTMLDocument
iDOC.body.innerHTML = XMLrequest.responseText
'Cells(2, 2).Value = iDOC.getElementsByClassName("d-flex justify-content-between")(0).getElementsByTagName("h5")(0).getElementsByTagName("a")(0).innerText
'Cells(2, 2).Value = iDOC.getElementById("8091724").innerText
'Cells(2, 2).Value = iDOC.getElementsByClassName("search-card")(0).getElementsByClassName("d-flex justify-content-between")(0).getElementsByTagName("h5")(0).getElementsByTagName("a")(0).innerText
Range("H1").Value = "Time Updated on"
Range("I1").Value = Now
Columns.AutoFit
MsgBox "Done"
End Sub
Sample of HTML code below:

The page you try to get creates the contents using JavaScript. However, in your code, innerHTML of iDOC is only static content.
For the page to property run JavaScript, you can automate IE using InternetExplorer.Application. Try googling keywords like "Automate Internet Explorer Using VBA."
EDIT
I read your comment. The page gets the READY state too quickly. So, you should wait for the contents to be generated in some way (e.g. sleep or check some element appeared).
Public Declare Sub Sleep Lib "kernel32" (ByVal ms As Long)
Sub sc2()
Dim objIE As New InternetExplorer
objIE.Visible = True
objIE.navigate "https://www.efinancialcareers.my/search/?countryCode=MY&radius=40&radiusUnit=km&page=1&pageSize=20&currencyCode=MYR&language=en0"
Do While objIE.Busy = True Or objIE.readyState < READYSTATE_COMPLETE
DoEvents
Loop
Dim htmlDoc As HTMLDocument
' Wait long enough
Sleep 10000
' ... Or wait until some element appears (some element disappears)
' Do
' Set htmlDoc = objIE.document
'
' If htmlDoc.getElementsXXX Then
' Exit Do
' End If
'
' DoEvents
' Sleep 1000
' Loop
Set htmlDoc = objIE.document
' Then you can access elements
' ... but this code also has a problem. ``.getElementsByTagName("h5")`` returns nothing. Please inspect the html.
Debug.Print htmlDoc.getElementsByClassName("d-flex justify-content-between")(0).getElementsByTagName("h5")(0).getElementsByTagName("a")(0).innerText
End Sub
Moreover, the code that accesses the elements also has a problem. Since it doesn't follow generated html, .getElementsByTagName("h5") returns nothing. Please inspect the html in Chrome's console or VBE's Watch window.
== $0 is not related to your problem. It simply means the active DOM element in the developer tool. (What does == $0 mean in the DOM view in developer tools?)
By the way, more and more sites are dropping support for IE. Using InternetExplorer object is convenient, but automating Chrome or Firefox with Selenium is a better approach.

Related

Access - Application Echo not working

I want to stop flickering of my form when VBA code is executed, but Application Echo is not working.
I have this code in Combobox_After_Update event :
Private Sub Combo11_AfterUpdate()
'Stop flickering
Application.Echo False
On Error Resume Next
'If User deletes Combo Item, then delete record
If IsNull(Combo11) Then
DoCmd.SetWarnings False
DoCmd.RunCommand acCmdDeleteRecord
End If
'Call code for re-positioning controls on form
Call MovingAllControls
'Close and reopen form
Call ReOpen
Application.Echo True
End Sub
Called procedure (which is probably cause of flickering):
Sub MovingAllControls
'Refresh
DoCmd.Requery
Const MaxRecs As Integer = 10
Dim NumRecs As Integer
On Error Resume Next
'find last record in subform and then expand Detail section according to number 'of records
With Forms![MyForm]![MySubform].Form
.Recordset.MoveLast
NumRecs = .RecordsetClone.RecordCount
If NumRecs > MaxRecs Then NumRecs = MaxRecs
.InsideHeight = NumRecs * .Section(0).Height + 350
End With
'Moving all controls under subform - in this example only one, but in reality I 'have plenty controls to move on form
Forms![MyForm]![FieldName].Top = Forms![MyForm]![Myubform].Top + Forms![MyForm]![MySubform].Form.InsideHeight + 1100
End Sub
Another procedure that is called from Combobox_After_Update event:
Sub ReOpen()
'I reopen form, because this is only way my subform controls moves as they 'should - dynamically
DoCmd.Close acForm, "MyForm"
DoCmd.OpenForm "MyForm"
End Sub
I also tried to to see what error is producing in After_Update_event, and I get error "424 - object required", but my code excecutes, only problem is flickering of controls. Any other way to stop flickering, or what is wrong with my code ?
Thanks for help !!
Rather than Application.Echo False, try the Form.Painting method:
' code under form
Me.Painting = False
' do actions that cause flicker
Me.Painting = True
Me.Repaint
Also, looking at your code here:
DoCmd.Close acForm, "MyForm"
DoCmd.OpenForm "MyForm"
I would say this is bad practice:
This can cause performance problems for even moderately large record sets.
If you use events like Form_Open or Form_Close, this can cause bugs by re-running code that was intended to only run once, when the form was first opened.
There is always a way in Access to get the results you want without closing/reopening the form.

ASP.Net treeview truncates node text

I have a treeview on my ASP.Net page and for some reason the text on some nodes gets cut off, I am programatically adding all the nodes and am aware of the existing issue listed here: http://support.microsoft.com/?scid=kb%3Ben-us%3B937215&x=8&y=13 however I am not changing the font and as you see in the code below this fix does not work for me.
Private Sub populateTreeView()
'Code that gets the data is here
Dim ParentIds As List(Of Integer) = New List(Of Integer)
For Each row As DataRow In ds.Rows
If ParentIds.Contains(row("ParentID")) Then
'' Do Nothing
Else
ParentIds.Add(row("ParentID"))
End If
Next
For Each Parent As Integer In ParentIds
Dim parentNode As New System.Web.UI.WebControls.TreeNode
For Each child In ds.Rows
If (child("ParentID") = Parent) Then
Dim childNode As New System.Web.UI.WebControls.TreeNode
parentNode.Text = child("ParentDescription")
parentNode.Value = child("ParentID")
parentNode.Expanded = False
childNode.Text = child("ChildDescription")
childNode.Value = child("ChildID")
parentNode.SelectAction = TreeNodeSelectAction.None
parentNode.ChildNodes.Add(childNode)
End If
Next
trvItem.Nodes.Add(parentNode)
Next
'This is just added to test the MS fix
trvItem.Nodes(0).Text += String.Empty
End Sub
The strange thing is that this issue only appears in IE, I have tested it in chrome and Firefox and both browsers display the text perfectly.
When I select a node this fixes the problem and all text displays as normal.
Any ideas as to what is going wrong here would be great as I'm clueless right now.
Thanks
tv1.LabelEdit = True
tv1.Nodes(0).Nodes(0).BeginEdit()
tv1.Nodes(0).Nodes(0).NodeFont = oNewFont
tv1.Nodes(0).Nodes(0).EndEdit(False)
tv1.LabelEdit = False
Marking this as closed since I never received an answer which solved the problem.
I managed to work around it by using the javascript postback to select one of the items on load, thus forcing the text to display correctly. I think this is an extension of the bug I linked in my original question.

ASP.NET redirect not working with invalid URL with UIProcess application block - looking for redirect architecture ideas?

In my global.asax.vb file, I have code to re-write the URL if there is a prefix on the URL. We are introducing a new context in our application. So every page will either be of context hair or saliva.
Before the ASP.NET code (stack) even reaches this Global code, it calls an application block called UIProcess. It's code that Microsoft wrote years ago, and is no longer supported. The UIP block sort of mimics MVC, and you store all views, navigation and controller details inside the web.config. The UIP block is doing a redirect as shown below. Note, they had a known bug that was never fixed (commented out), so I had to recompile it before upgrading from .NET 2.0 to .NET 3.5. That's what I have commented out. That's the only bug I'm aware of.
private void RedirectToNextView(string previousView, ViewSettings viewSettings)
{
try
{
//if (previousView == null)
// HttpContext.Current.Response.Redirect(HttpContext.Current.Request.ApplicationPath + "/" + viewSettings.Type, true);
//else
// HttpContext.Current.Response.Redirect(HttpContext.Current.Request.ApplicationPath + "/" + viewSettings.Type, false);
if (previousView == null)
HttpContext.Current.Response.Redirect(HttpContext.Current.Request.ApplicationPath.TrimEnd('/') + viewSettings.Type, true);
else
HttpContext.Current.Response.Redirect(HttpContext.Current.Request.ApplicationPath.TrimEnd('/') + viewSettings.Type, false);
}
catch (System.Threading.ThreadAbortException) { }
}
Here is the Global.asax.vb code:
(again this code doesn't matter right now because it's not getting here YET with the exception being thrown)
Sub Application_BeginRequest(ByVal sender As Object, _
ByVal e As EventArgs)
' Fires at the beginning of each request
Dim originalUri As Uri = Request.Url
Dim rewrittenUrl As String = String.Empty
'Rewrite Saliva and Hair Testing urls
Select Case True
Case originalUri.AbsolutePath.StartsWith("/HairTest/")
rewrittenUrl = originalUri.AbsolutePath.Remove(0, 9)
If Not originalUri.Query.Contains("SampleTypeContext=247") Then
rewrittenUrl += "?sampleTypeContext=247"
End If
Case originalUri.AbsolutePath.StartsWith("/SalivaTest/")
rewrittenUrl = originalUri.AbsolutePath.Remove(0, 11)
If Not originalUri.Query.Contains("SampleTypeContext=3301") Then
rewrittenUrl += "?sampleTypeContext=3301"
End If
End Select
If rewrittenUrl <> String.Empty Then
'append the original query if there was one specified
If originalUri.Query <> String.Empty Then
If rewrittenUrl.Contains("?") Then
rewrittenUrl += "&"
Else
rewrittenUrl += "?"
End If
rewrittenUrl += originalUri.Query.Remove(0, 1)
End If
Context.RewritePath(rewrittenUrl)
End If
End Sub
The application is actually causing an exception above, when I try to pre-pend my URL (viewSettings.Type variable above) with "/HairTest" or "/SalivaTest". It causes that System.Threading.ThreadAbortException. I'm thinking because that path doesn't actually exist in our web application, but I'm just guessing. Notice, we're doing a re-write in our global, not a redirect. Our re-write prepends the URL with "/HairTest" or "/SalivaTest".
All of the pages in our web application expect that "SampleTypeContext" parameter if it needs it. If you can think of a way that will work better for this situation, let me know. I'll try to get more details on the exception.
Looking for ideas!! Our architecture approach is still up for discussion if we run into issues with this UIProcess block. We can't just get rid of the UIP block since it's used throughout our application, but I can modify the code above (in my first code snippet) if we need to.
There's no standard way to do this with the UIP block. And Microsoft doesn't maintain it anymore. I overloaded a bunch of the methods in the library to make it work. So that we can append a query string parameter (applicationScope) on each of our UIP redirect calls.
If you need the code, add a comment here, and I'll send you the source.

No session share and avoid navigation buttons in browser while opening application window

We are using VBScript code to open the application window to avoid users having forward/back navigation while opening the IE8 window.
This is the code used.
Set WshShell = CreateObject("shell.application")
Set IE=CreateObject("InternetExplorer.Application")
IE.menubar = 1
IE.toolbar = 0
IE.statusbar = 0
'here we open the application url
IE.navigate "http://www.google.com"
IE.visible = 1
WshShell.AppActivate(IE)
This is working fine, however the problem is that if the user opens multiple windows the session cookies are shared accross the windows.
For this also there is a solution that we can use the nomerge option while opening the IE
WshShell.ShellExecute "iexplore.exe", " -nomerge http://www.google.com", null, null, 1
Now we want both these options to be available. i.e user should not be able to navigate forward/backward and also if two windows are opened data should not be shared.
We were not able to get both these things working together.
Also we do not want any full screen mode(i.e after pressing F11)
Can any one provide the solution?
Thanks in advance.
See the answers for this very closely related question: Launch Internet Explorer 7 in a new process without toolbars. There are a couple of workable options, one using Powershell, and one using some cludgy VBScript.
From what I understand, cookies are set by instance. Multiple browser windows are still going to be the same instance.
You might be able to pass in a sort of ID parameter that the program tracks, but the browser doesn't. That way regardless of how the program runs, it will have its own 'session' ID.
I think you can do this with javascript, and reading it using a asp.net hidden field. This might give you the uniqueness you're looking for.
<asp:HiddenField ID="HiddenFieldSessionID" runat="server" />
protected void Page_Load(object sender, EventArgs e)
{
HiddenFieldSessionID.Value = Session.SessionID;
}
<script type="text/javascript">
function ShowSessionID()
{
var Hidden;
Hidden = document.getElementById("HiddenFieldSessionID");
document.write(Hidden.value);
}
</script>
The solution mentioned in the link answered by patmortech is not perfect, as the cookies were still shared. So used the -nomerge option in the AppToRun variable which creates two processes when user opens the application twice in the single machine.
In IE8 if two internet explorers are opened then they are merged into single process so the -nomerge option which opens the IE8 instances in difference processes.
On Error Resume Next
AppURL = "http://www.stackoverflow.com"
AppToRun = "iexplore -nomerge"
AboutBlankTitle = "Blank Page"
LoadingMessage = "Loading stackoverflow..."
ErrorMessage = "An error occurred while loading stackoverflow. Please close the Internet Explorer with Blank Page and try again. If the problem continues please contact IT."
EmptyTitle = ""
'Launch Internet Explorer in a separate process as a minimized window so we don't see the toolbars disappearing
dim WshShell
set WshShell = WScript.CreateObject("WScript.Shell")
WshShell.Run AppToRun, 6
dim objShell
dim objShellWindows
set objShell = CreateObject("Shell.Application")
set objShellWindows = objShell.Windows
dim ieStarted
ieStarted = false
dim ieError
ieError = false
dim seconds
seconds = 0
while (not ieStarted) and (not ieError) and (seconds < 30)
if (not objShellWindows is nothing) then
dim objIE
dim IE
'For each IE object
for each objIE in objShellWindows
if (not objIE is nothing) then
if isObject(objIE.Document) then
set IE = objIE.Document
'For each IE object that isn't an activex control
if VarType(IE) = 8 then
if IE.title = EmptyTitle then
if Err.Number = 0 then
IE.Write LoadingMessage
objIE.ToolBar = 0
objIE.StatusBar = 1
objIE.Navigate2 AppURL
ieStarted = true
else
'To see the full error comment out On Error Resume Next on line 1
MsgBox ErrorMessage
Err.Clear
ieError = true
Exit For
end if
end if
end if
end if
end if
set IE = nothing
set objIE = nothing
Next
end if
WScript.sleep 1000
seconds = seconds + 1
wend
set objShellWindows = nothing
set objShell = nothing
'Activate the IE window and restore it
success = WshShell.AppActivate(AboutBlankTitle)
if success then
WshShell.sendkeys "% r" 'restore
end if

Listing a folder structure in Classic ASP

I've developed a secure page in ASP for the company I work for. There is a landing (login page) that once you are authenticated you are taken to a page that has links to several sub pages. Each sub page has a folder structure. For example: There is a heading for Meeting Minutes and then underneath and indented are links referencing PDFs that contain the information. There may be 3 or 4 headings with documents linked beneath.
The original version had a PHP script that ran and would sync up the live site on the server from a folder structure that would be mimicked onto the live site. So if I had a folder called Folder1 and sub folders named test1 test2 test3.. the live site would display them accordingly. Since the site is now in ASP and not PHP.. the PHP script no longer works (since PHP doesn't play well with ASP).
I found a snippet online that somewhat works for what i'm trying to achieve (i.e. Folder/Subfolder/File Name structure), however i'm stuck at the moment with how to link the files so they open when clicked. I keep seeing a %25 in the file name. I know %20 is the same as a blank space and since I am dealing with file and folder names that contain spaces, this appears to be my issue. I've tried adding in a %20 but the spaces become "%2520".
If you look at the code below, there is a link towards the bottom that calls "MapURL". I have that link commented out at the moment as I was trying to figure out where the %25 was coming from.
Anyone have any thoughts on how to get the links to work?
Here is the snippet.
dim path
path = "PATH TO THE FOLDER ON THE SERVER"
ListFolderContents(path)
sub ListFolderContents(path)
dim fs, folder, file, item, url
set fs = CreateObject("Scripting.FileSystemObject")
set folder = fs.GetFolder(path)
'Display the target folder and info.
Response.Write("<ul><b>" & folder.Name & "</b>") '- " _
' & folder.Files.Count & " files, ")
'if folder.SubFolders.Count > 0 then
' Response.Write(folder.SubFolders.Count & " directories, ")
'end if
'Response.Write(Round(folder.Size / 1024) & " KB total." _
' & "</ul>" & vbCrLf)
Response.Write("<ul>" & vbCrLf)
'Display a list of sub folders.
for each item in folder.SubFolders
ListFolderContents(item)
next
'Display a list of files.
for each item in folder.Files
'url = MapURL(item.path)
'Response.Write("<li>" & item.Name & " - " _
Response.Write("<li>" & item.Name & " - " _
& item.Name & "</a>" _
& "</li>" & vbCrLf)
next
Response.Write("</ul>" & vbCrLf)
Response.Write("</ul>" & vbCrLf)
end sub
function MapURL(path)
dim rootPath, url
'Convert a physical file path to a URL for hypertext links.
rootPath = Server.MapPath("/")
url = Right(path, Len(path) - Len(rootPath))
MapURL = Replace(url, "\", "/")
end function
There are several things wrong with your code.
First and foremost, you do not encode the values you output at all. This is a big mistake. You are missing URL-encoding for things that go into the HREF attribute, and you miss HTML-encoding for everything else.
Next, you create a new FileSystemObject with every call to the recursive ListFolderContents() function. This is unnecessarily wasteful and will become slow as soon as there are more than a handful of files to be output.
Your recursive function should take a Folder object as the first argument, not a path. This makes things a lot easier.
The HTML structure you output is invalid. <b> cannot legally be a child of <ul>.
I completely rewrote your code to produce more correct output and to be as fast as possible. Crucial to your problem is the PathEncode() function, it transforms a relative path to a properly encoded URL. The other things should be pretty self-explanatory:
ListFolder "P:\ATH\TO\THE\FOLDER\ON\THE\SERVER"
' -- Main Functions ----------------------------------------------------
Sub ListFolder(path)
Dim fs, rootPath
Set fs = CreateObject("Scripting.FileSystemObject")
rootPath = Replace(path, Server.MapPath("/"), "") & "\"
ListFolderContents fs.GetFolder(path), PathEncode(rootPath)
End Sub
' ----------------------------------------------------------------------
Sub ListFolderContents(folder, relativePath)
Dim child
Say "<ul>"
Say "<li><div class=""folder"">" & h(folder.Name) & "</div>"
For Each child In folder.SubFolders
If Not IsHidden(child) Then
ListFolderContents child, relativePath & PathEncode(child.Name) & "/"
End If
Next
relativePath = h(relativePath)
For Each child In folder.Files
If Not IsHidden(child) Then
Say "<li>" & h(child.Name) & "</li>"
End If
Next
Say "</ul>"
End Sub
' -- Helper Functions / Shorthands ---------------------------------------
Sub Say(s)
Response.Write s & vbNewLine
End Sub
Function h(s)
h = Server.HTMLEncode(s)
End Function
Function PathEncode(s)
' this creates a more correct variant of what Server.URLEncode would do
PathEncode = Replace(s, "\", "/")
PathEncode = Server.URLEncode(PathEncode)
PathEncode = Replace(PathEncode, "+", "%20")
PathEncode = Replace(PathEncode, "%2F", "/")
PathEncode = Replace(PathEncode, "%2E", ".")
PathEncode = Replace(PathEncode, "%5F", "_")
End Function
Function IsHidden(File)
IsHidden = File.Attributes And 2 = 2
End Function
Notes
Use the <div class="folder"> to apply CSS styles (i.e. bold etc.) to the folder name.
The function will not output hidden files or directories.
The relativePath argument is used to keep the workload as low as possible - when a folder has 1000 files, it makes no sense to calculate the entire relative path 1000 times. With the help of this parameter, only the part that actually changes is processed.
Having functions like Say() or h() around can save you a lot of typing and it keeps the code more clean, too.
You should read up on URL-encoding (and HTML-encoding as well). Seems like you've never come across these things, which is especially bad if your task is to build a secure site.
you probably need extra quotes at the href (""). The best way is to see the generated source code (from the resulting page) like <a href=""" & replace(...) & """>"
Basically, if you use only one quote it just closes the string, but you are missing the HTML quote needed after href= and the closing one.

Resources