Script or Library to find contact means on a website - information-retrieval

Does anyone know a script/recipe/library to find most relevant contact information on a website?
Some possible case:
Find contact phone number on a personal web page
Find owner email address on a blog
Find url of the contact page

Check out WSO2's Mashup Server. You can run it on your local machine and follow the tutorial for scraping. You could pass the dynamic parameters you need into the <http> element of the scraper to loop through multiple sites running the same scrape, then push everything to a collection source (AJAX application for capturing the information or store inside WSO2 server). You can write very complex search patterns using XPath and XSLT to capture the information you want.
I don't have enough information about the specific sites you are scraping to help with the script, but any way you go, it's going to take a lot of trial and error until you get the result you are looking for.
Happy scraping!

I'm not aware of any libraries that do this.
Hm, I would use regular expressions to match for phone numbers and email addresses, combined with a web spider that walks the site, and then a method for ranking the contact information.
Typically contact information will also be partnered with one of a few common labels such as "Support", "Support email", "Sales", etc. There's probably a dozen or so versions of this that will cover 95% of all sites in English.
So, basically I would start by building a simple recursive web spider that walks all the publicly accessible pages in a given domain, parsing the HTML for email addresses and phone numbers, and making a list of them, and then ranking them based on whether or not they are listed near to any of the common labels.
It won't be perfect, but then again, that's part of the value of the algorithm - making it smarter, and tweaking it over time until it gets better.

Related

Available Miicrosoft Cognitive regions

I'm looking for a Voice Authentication API, and I find Microsoft's one.
When looking at prices, it asks you for a region. The problem is that
it only shows a region
I've been reading about Azure's regions, and it say that is where data is stored, so my question is if it would be possible to use it in a different region than allowed.
Thanks (and sorry for my spelling mistakes).
Quick Answer:
Normally yes, but currently the Speaker Recognition API is only offered out of the WestUS datacenter.
If it's mandatory that you have low-latency when using the service, I suggest you look into setting up and/or temporarily subscribing to a CDN service. Or, if you have a lot of time on your hands, and know waaaaay more than I do about this subject, you may be able to design a local cache to mitigate latency if you're distant from WestUS.
Less-Quick Answer:
First off, you should use the dashboard interface at https://portal.azure.com to sign up. You will first need to create a Pay-As-You-Go subscription as your payment-medium, but it will give you much more control over & visibility into your service.
Here's what the signup pane looks like inside of https://portal.azure.com:
It appears that, in it's current "PREVIEW" deployment, you are right the services is only offered from the the WestUS data center. Normally you will have the option to one of ten's of global datacenters, but it is common that PREVIEW services aren't deployed globally until they're out of PREVIEW status.
If the problem you are looking to remediate is latency-based, look into the CDN suggestion in my "Quick Answer."
If your issue is about getting different pricing based on your location, the location of the datacenter you choose will not affect this. If geographic-discounting applies to you, it is based on the country that is assigned to your Microsoft Username/Password combination at the time it was created. This value cannot be changed once a username/password combo has been created, and consequently, any payment info used along with this uname/pass will need to have a billing address in the same country.

Is there a good way to link registered users' emails with data in google analytics?

If I build a website for my new awesome mobile app (or web service or whatever) I might want to do a slow launch, sending email invites to the first x people to register on the site.
Is there a good way to link each registered email to the corresponding data in google analytics (or any similar service), and query them based on location, language, etc.?
Maybe the spanish version isn't quite done yet, so I don't want to invite people who used a spanish browser to sign up. Or maybe my app is location-dependent (like time tables for buses) and just doesn't work at all outside of my home town.
I really want to have a simple email-only "registration".
It is completely possible, although it may breach some of GA's terms of use if done wrong.
You should not store email addresses in any way as part of your GA data because it would be considered personally identifiable data. However, there is nothing saying that you couldn't store a kind of GUID for each user, and then compare that with email addresses offline - although the user should be made aware that any actions they take while using your service/application/whatever are being tracked with the capability of being personally identified.
As far as getting the actual data that you are discussing, language and location are stored by GA by default, so no headache there!
The best way to store the user's GUID would probably be in a custom dimension. How you do this is going to depend on how you build your product. I had to write a tracking library using the measurement protocol for an AS3 project awhile back because there isn't an AS3 library that is supported anymore. If you are using JavaScript, it will be much easier, as Google offers native JS libraries to handle web analytics.
Finally, try taking a look at the documentation. Its pretty easy to understand

How to send an anonymous email through Wordpress?

I have a client who has a crimestoppers' website. They want to provide visitors a means to submit anonymous crime tips, which would then be forwarded to a pre-established email address at the local police department.
What is the best / easiest way to accomplish this? The sender's IP address needs to be hidden. My client also needs to be able to pull reports showing how many tips were submitted and forwarded.
Many thanks!
A simple contact form can be used. It's up to the developer's trust to hide the IP. The submitter won't see anything what is being done using PHP.
You can then update a database with the tips being posted before sending the mails.
In terms of development, you can use a plugin such as Contact Form 7 and then use its hooks to save the tips submitted before sending the mails.
While it is rather simple to set up a contact form that submits to an email address (just use the excellent ContactForm7, as rrikesh's answer suggests). However, getting anonymity right (especially against a party that has as much power and resources) is tricky. You need to be clear about the level of anonymity that you can provide. Log files, document metadata or your ISP can easily give a lot of information away.
Here are two project that have different approaches. They're both not ready-made solutions to your question, but still relevant:
PrivacyBox:
This is a web service run by the German Privacy Foundation. It's basically a message relay like the one you want, except that the user has to trust the Foundation, not you. This model highly depends on the institution providing this service. I'm sure there are other, US-based services like this.
Briefkasten:
An open source software tool used by the German newspaper Die ZEIT.
a reasonably secure web application for submitting content anonymously. It allows to upload attachments which are then sanitized of a number of meta-data which could compromise the submitters identity. Next, the sanitized files are encrypted via GPG and sent via email to a pre-configured list of recipients. The original (potentially 'dirty') files are then deleted from the file system of the server. Thus, neither should admins with access to the server be able to access any submissions, nor should any of the recipients have access to the unsanitized raw material.
This is an attempt to automate the crucial steps to strip any identifying data from the submission and encrypt it, so only the intended recipients can access it.
You would have to host this yourself, though. And it's a Python app.

Will Google block my access if I use their features without token?

I'm using this link https://www.google.com/reader/api/0/stream/contents/feed/FEEDHERE?output=json&n=20
to fetch feeds using Google's algorithm. As you can see I'm not adding any other parameters, just fetching the returned data in JSON format. My app will be heavily used hopefully and if I send a lot of requests to this link, will Google block my access or something?
Is there anything I can include, like userip, url for my app (so if they have problem to just contact me) or something else?
The most basic answer to your question is that Google will change its Terms of Service whenever it likes, and you've got no say in the matter. So if it's allowed today, it might not be allowed tomorrow, at Google's whim.
On this issue, though, you seem fairly safe. From the Terms of Service (these is the general document, since Reader doesn't seem to have a specific one):
Don’t misuse our Services. For example, don’t interfere with our Services or try to access them using a method other than the interface and the instructions that we provide.
Google provides RSS and Atom. They provide these feeds, so I assume they expect that they'll be used. They don't say that it's a misuse to point someone else at those feeds, so it looks OK for now, but they could add such a clause at any time.
All online services are subject to the terms and conditions of the providers of those services. So, as others have said, they may be ok with your use today, but they can change their mind any time down the line. I doubt including a URL or email or contact info will help anything, because when these services change, they don't notify every user of the service, they just announce the change publicly, and usually they give several month's notice in order to give users a chance to adapt their applications, but this is not standardized or enforced so there is no guarantee. One example would be the fairly recent discontinuance of the Google Finance API (for which no replacement has been announced).
The safest approach would be to design your app such that this feature that uses google's functionality is decoupled as much as possible from the rest of your app, so that, when or if the availability of the service changes (ie it's no longer available at all) you can adapt your app to use some other source for the feeds with minimal impact to the rest of the app. Design for change and plan for the worst.

How to log and analyze certain user actions on my website

I have a simple page that provides a search experience. It allows users to search via a query form, filter results, and perform more in-depth searches based on the results of the first search.
I would like to get some metrics around the user experience and how they are using the page. Most of the user actions translate in a new query string. For example:
how many users perform a search and then follow up with another search / filter
how many times a wildcard is used in the search query
how many results does a user browse before a new search
I am also limited of using google analytics and the sort because of copyright issues (maybe I can make a case if it is really the way to go for open web analytics or smth). Server side I am thinking of using cookies to track users and log4net to log what they do, then dump the info in a db and do analysis from there. Or log to the event viewer and use the Log Viewer to get the info from there.
What do you think is the overall better approach?
I would recommend you use an existing, off-the-shelf solution for this, rather than building your own - it's the kind of project that very rapidly grows in size. You go from the 3 metrics in your question to "oh, and can you break that down by the country from which the user browses?", "what languages affect the questions?", "do they end up buying anything if they click results for bananas?". And then, before you know it, you've built your own web analytics tool...
So, you can either use "web analytics as service" offerings like Google Analytics, or use a more old-fashioned log-parsing solution. Most of the questions you want to answer can be derived from the data in the IIS web logs; there are numerous applications to parse that data, including open source and free solutions.
It's been a long while since I used a log file based analytics tool, but my ISP provides AWStats, which seems pretty good - to do what you want, you'll have to set up specific measurements around your search page; not sure if AWStats does that (Google Analytics definitely does); check the Wikipedia list for log file analysis tools which do that.
Obviously you need to log every submit of the search page.
In particular you need to log:
DateTime.Now
SearchString
SessionID
You could also store a counter in the Session that will be incremeted each time a user loads a page, that is not the search page.
If the users performs a search you could read that value from the session, store it in the database and reset the counter.
Be aware that the metric of "how many results does a user browsw before a new search" should only be taken as an estimate and not as a real metric, due to cookie support, multitabbing, page reloads et cetera.

Resources