How can I extract information from a web page into an Excel sheet?
The website is https://www.proudlysa.co.za/members.php and I would like to extract all the companies listed there and all their respective information.
The process you're referring to is called web scraping, and there are several VBA tutorials out there for you to try.
Alternatively, you can always try
(source: netdna-ssl.com)
I tried creating something to grab for all pages. But ran of time and had bugs. This should help you a little. You will have to do this on all 112 pages.
Using chrome go to the page
type javascript: in the url then paste the code below. it should extra what you need. then you will have to just copy and paste it in to excel.
var list = $(document).find(".pricing-list");
var csv ="";
for (i = 0; list.length > i;i++) {
var dataTags = list[i].getElementsByTagName('li');
var dataArr = [];
for (j = 0; dataTags.length > j;j++) {
dataArr.push(dataTags[j].innerText.trim());
}
csv += dataArr.join(', ') + "<br>";
}
you will get something like this
EDITTED
use this instead will automatically download each page as csv then you can just combine them after somehow.
Make sure to type javascript: in url before pasting and pressing enter
Also works with chrome, not sure about other browsers. i dont use them much
var list = $(document).find(".pricing-list");
var csv ="data:text/csv;charset=utf-8,";
for (i = 0; list.length > i;i++) {
var dataTags = list[i].getElementsByTagName('li');
var dataArr = [];
for (j = 0; dataTags.length > j;j++) {
dataArr.push(dataTags[j].innerText.trim());
}
csv += dataArr.join(', ') + "\n";
}
var a = document.createElement("a");
a.href = ""+ encodeURI(csv);
a.download = "data.csv";
a.click();
This question already has answers here:
How to identify unused CSS definitions from multiple CSS files in a project
(3 answers)
Closed 9 years ago.
I was thinking of writing a script which would tell me:
How often each CSS class defined in my .css file is used in my code
Redundant CSS classes - classes never used
CSS classes hat are referenced that don't exist.
But I just want to make sure something like this doesn't exist already? Does it?
Thanks
Just for fun, I wrote one.
try it
First we need to find our style sheet. In an actual script, this would be written better, but this works on jsFiddle.
var styles = document.head.getElementsByTagName('style');
var css = styles[styles.length - 1].innerHTML;
Then remove comments, and the bodies of each selector (i.e. the stuff between the brackets). This is done because there could be a .com in a background-image property, or any number of other problems. We assume there isn't a } in a literal string, so that would cause problems.
var clean = css.replace(/\/\*.*?\*\//g, '').replace(/\{[^}]*\}/g, ',');
We can find classes with regular expressions, and then count how many of them occur.
var re_class = /\.(\w+)/g;
var cssClasses = {}, match, c;
while (match = re_class.exec(clean)) {
c = match[1];
cssClasses[c] = cssClasses[c] + 1 || 1;
}
I used jsprint for displaying our findings. This shows how many times each class is mentioned in our CSS.
jsprint("css classes used", cssClasses);
Thanks to Google and this answer we can find all elements in the body, and loop through them. By default, we assume no classes were used in our HTML, and all classes used were defined.
var elements = document.body.getElementsByTagName("*");
var neverUsed = Object.keys(cssClasses);
var neverDefined = [];
var htmlClasses = {};
We get each elements class, and split it on the spaces.
for (var i=0; i<elements.length; i++) {
var e = elements[i];
var classes = (e.className || "").split(" ");
This is a three dimensional loop, but it works nicely.
for (var j=0; j<classes.length; j++) {
for (var k=0; k<neverUsed.length; k++) {
We thought classes[j] was never used, but we found a use of it. Remove it from the array.
if (neverUsed[k] === classes[j]) {
neverUsed.splice(k, 1);
}
}
It looks like we found a class that doesn't appear in our CSS. We just need to make sure it's not an empty string, and then push it onto our array.
if (classes[j].length && cssClasses[classes[j]] == null) {
neverDefined.push(classes[j]);
}
Also count the number of times each class is used in HTML.
if (classes[j].length) {
htmlClasses[classes[j]] = htmlClasses[classes[j]] + 1 || 1;
}
}
}
Then display our results.
jsprint("html class usage", htmlClasses);
jsprint("never used in HTML", neverUsed);
jsprint("never defined in CSS", neverDefined);
I have a page that shows a javascript countdown. The javascript automatically populates "d" for days, "h" for hours, etc... CSS adds "ay(s)", "our(s)", etc..., as space allows, and capitalizes them.
Javascript:
function cdtd(broadcast) {
var nextbroadcast = new Date(broadcast);
var now = new Date();
var timeDiff = nextbroadcast.getTime() - now.getTime();
if (timeDiff <= 0) {
clearTimeout(timer);
document.getElementById("countdown").innerHTML = "<a href=\"flconlineservices.php\">Internet broadcast in progress<\/a>";
/* Run any code needed for countdown completion here */
}
var seconds = Math.floor(timeDiff / 1000);
var minutes = Math.floor(seconds / 60);
var hours = Math.floor(minutes / 60);
var days = Math.floor(hours / 24);
hours %= 24;
minutes %= 60;
seconds %= 60;
document.getElementById("daysBox").innerHTML = days + " d";
document.getElementById("hoursBox").innerHTML = hours + " h";
document.getElementById("minsBox").innerHTML = minutes + " m";
// seconds isn't in our html code (javascript error if this isn't commented out)
/*document.getElementById("secsBox").innerHTML = seconds + " s";*/
var timer = setTimeout('cdtd(broadcast)',1000);
}
CSS:
[role="navigation"] {text-transform:capitalize;}
#media screen and (min-width:1600px) {
#countdown #daysBox:after {content:"ay(s)";}
#countdown #hoursBox:after {content:"our(s)";}
#countdown #minsBox:after {content:"inute(s)";}
}
Firefox and Opera display the countdown as I expected (3 Day(s), 5 Hour(s), etc...), but Internet Explorer capitalizes the (s) (3 Day(S), 5 Hour(S), etc...). Safari and Chrome are even worse, as they capitalize the (s) and the first letter of the CSS generated content (3 DAy(S), 5 HOur(S), etc...).
I found a page that shows typography bugs with :first-letter and :first-line that may be somewhat related.
I tried doing text-transform:lowercase and then text-transform:capitalize, but that didn't change the results.
Any ideas on how to fix this? I'll probably just knock out the capitalization, but then I have to make sure everything is typed in the correct casing.
JJ
Ok so from what I can tell you just need the first letter of days, hours and minutes caps. You can do this in javascript. Something like
var daysString = days + " d";
document.getElementById("daysBox").innerHTML = daysString.toUpperCase();
Update: forgot to mention take our #countdown {text-transform:capitalize;}
I did read more function but it's not working correctly. I mean I can split my test post and I can cut my string with substring function. And I did this using < !--kamore--> keyword.
But after I cut this with substring and do innerhtml and if there is some html tag before the index the css is going crazy. (< p>< !--kamore-->) I can't solve this. If I'm using regex it just make all of them like text and there is no html tags in my post and it's not good. I mean if there is some links or table in my post they will not showing. They are just text.
Here is my little code.
#region ReadMore
string strContent = drvRow["cont"].ToString();
//strContent = Server.HtmlDecode(strContent);
//strContent = Regex.Replace(strContent, #"</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>", string.Empty);
// More extension by kad1r
int kaMoreIndex;
kaMoreIndex = strContent.IndexOf("<!--kamore-->");
if (kaMoreIndex > 0)
{
if (strContent.Length >= kaMoreIndex)
{
aReadMore.Visible = true;
article.InnerHtml = strContent.Substring(0, kaMoreIndex);
// if this ends like this there is a problem
// < p>< !--kamore--> or < div>< !--kamore-->
// because there is no end of this tag!
}
else
{
article.InnerHtml = strContent;
}
}
else
{
article.InnerHtml = strContent;
}
#endregion
I fix it. I found this code and after finding I added to my string. Now everything works fine.
http://social.msdn.microsoft.com/Forums/en-US/csharpgeneral/thread/0f06a2e9-ab09-4692-890e-91a6974725c0
How do I detect what browser (IE, Firefox, Opera) the user is accessing my site with? Examples in Javascript, PHP, ASP, Python, JSP, and any others you can think of would be helpful. Is there a language agnostic way to get this information?
If it's for handling the request, look at the User-Agent header on the incoming request.
UPDATE: If it's for reporting, configure your web server to log the User-Agent in the access logs, then run a log analysis tool, e.g., AWStats.
UPDATE 2: FYI, it's usually (not always, usually) a bad idea to change the way you're handling a request based on the User-Agent.
Comprehensive list of User Agent Strings from various Browsers
The list is really large :)
You would take a look at the User-Agent that they are sending. Note that you can send whatever agent you want, so that's not 100% foolproof, but most people don't change it unless there's a specific reason to.
A quick and dirty java servlet example
private String getBrowserName(HttpServletRequest request) {
// get the user Agent from request header
String userAgent = request.getHeader(Constants.BROWSER_USER_AGENT);
String BrowesrName = "";
//check for Internet Explorer
if (userAgent.indexOf("MSIE") > -1) {
BrowesrName = Constants.BROWSER_NAME_IE;
} else if (userAgent.indexOf(Constants.BROWSER_NAME_FIREFOX) > -1) {
BrowesrName = Constants.BROWSER_NAME_MOZILLA_FIREFOX;
} else if (userAgent.indexOf(Constants.BROWSER_NAME_OPERA) > -1) {
BrowesrName = Constants.BROWSER_NAME_OPERA;
} else if (userAgent.indexOf(Constants.BROWSER_NAME_SAFARI) > -1) {
BrowesrName = Constants.BROWSER_NAME_SAFARI;
} else if (userAgent.indexOf(Constants.BROWSER_NAME_NETSCAPE) > -1) {
BrowesrName = Constants.BROWSER_NAME_NETSCAPE;
} else {
BrowesrName = "Undefined Browser";
}
//return the browser name
return BrowesrName;
}
You can use the HttpBrowserCapabilities Class in ASP.NET. Here is a sample from this link
private void Button1_Click(object sender, System.EventArgs e)
{
HttpBrowserCapabilities bc;
string s;
bc = Request.Browser;
s= "Browser Capabilities" + "\n";
s += "Type = " + bc.Type + "\n";
s += "Name = " + bc.Browser + "\n";
s += "Version = " + bc.Version + "\n";
s += "Major Version = " + bc.MajorVersion + "\n";
s += "Minor Version = " + bc.MinorVersion + "\n";
s += "Platform = " + bc.Platform + "\n";
s += "Is Beta = " + bc.Beta + "\n";
s += "Is Crawler = " + bc.Crawler + "\n";
s += "Is AOL = " + bc.AOL + "\n";
s += "Is Win16 = " + bc.Win16 + "\n";
s += "Is Win32 = " + bc.Win32 + "\n";
s += "Supports Frames = " + bc.Frames + "\n";
s += "Supports Tables = " + bc.Tables + "\n";
s += "Supports Cookies = " + bc.Cookies + "\n";
s += "Supports VB Script = " + bc.VBScript + "\n";
s += "Supports JavaScript = " + bc.JavaScript + "\n";
s += "Supports Java Applets = " + bc.JavaApplets + "\n";
s += "Supports ActiveX Controls = " + bc.ActiveXControls + "\n";
TextBox1.Text = s;
}
PHP's predefined superglobal array $_SERVER contains a key "HTTP_USER_AGENT", which contains the value of the User-Agent header as sent in the HTTP request. Remember that this is user-provided data and is not to be trusted. Few users alter their user-agent string, but it does happen from time to time.
On the client side, you can do this in Javascript using the navigation.userAgent object. Here's a crude example:
if (navigator.userAgent.indexOf("MSIE") > -1)
{
alert("Internet Explorer!");
}
else if (navigator.userAgent.indexOf("Firefox") > -1)
{
alert("Firefox!");
}
A more detailed and comprehensive example can be found here: http://www.quirksmode.org/js/detect.html
Note that if you're doing the browser detection for the sake of Javascript compatibility, it's usually better to simply use object detection or a try/catch block, lest some version you didn't think of slip through the cracks of your script.
For example, instead of doing this...
if(navigator.userAgent.indexOf("MSIE 6") > -1)
{
objXMLHttp = new ActiveXObject("Microsoft.XMLHTTP");
}
else
{
objXMLHttp = new XMLHttpRequest();
}
...this is better:
if(window.XMLHttpRequest) // Works in Firefox, Opera, and Safari, maybe latest IE?
{
objXMLHttp = new XMLHttpRequest();
}
else if (window.ActiveXObject) // If the above fails, try the MSIE 6 method
{
objXMLHttp = new ActiveXObject("Microsoft.XMLHTTP");
}
It may be dependent of your setting. With apache on linux, its written in the access log /var/log/apache2/access_log
You can do this by:
- looking at the web server log, OR
- looking at the User-Agent field in the HTML request (which is a plain text stream) before processing it.
First of all, I'd like to note, that it is best to avoid patching against specific web-browsers, unless as a last result -try to achieve cross-browser compatibility instead using standard-compliant HTML/CSS/JS (yes, javascript does have a common denominator subset, which works across all major browsers).
With that said, the user-agent tag from the HTTP request header contains the client's (claimed) browser. Although this has become a real mess due to people working against specific browser, and not the specification, so determining the real browser can be a little tricky.
Match against this:
contains browser
Firefox -> Firefox
MSIE -> Internet Explorer
Opera -> Opera (one of the few browsers, which don't pretend to be Mozilla :) )
Most of the agents containing the words "bot", or "crawler" are usually bots (so you can omit it from logs / etc)
check out browsecap.ini. The linked site has files for multiple scripting languages. The browsecap not only identifies the user-agent but also has info about the browser's CSS support, JS support, OS, if its a mobile browser etc.
cruise over to this page to see an example of what info the browsecap.ini can tell you about your current browser.