How google scrape stackoverflow all questions while searching? - web-scraping

I am making search engine. But I want to know, How google scrapes all data of stackoverflow.
As my intuition,
Do they save all stackoverflow data in csv file?
and when user types some coding question, use some algorithm and recommend users.
or anything else,
Or Anything else?
Thank you for help.

Storing every data in a csv and running a search will probably cost you hours to retrieve a result.
Google Search Engine works in 3 stages, Crawling, Indexing, and Serving.
These algorithms working in these 3 stages are what makes Google so powerful. As they are fine tuned after years of optimization and learning, they can accurate index each webpage and analyze them without just plainly storing everything, just what you suggested.
Reference: How Search Works

Related

How to search the internet for pages containing specified terms and storing the results in a data table, from within R, using OpenSearch

I am setting up a database of certain events that have occurred in the past, and need to search the internet for a number of terms to retrieve as many pages as possible that contain terms related to the happenings i want to document.
First I looked into achieving this using Googles "Custom Search API", after reading this question:
Need to access Google Custom search api through R
I did manage to get a JSON of search results through the browser, but not through R, so I moved on.
When I saw that the Custom Search API was using OpenSearch, and found the rOpenSearch package for R, I wanted to try going down this path:
http://terradue.github.io/rOpenSearch/
After reading through the documentation, there was only provided examples of searching sites that provide opensearch descriptions. As I need to search as many websites as possible, it seems like I would need an opensearch description for a search engine like Google. But I can't seem to find that anywhere.
Is there any way to search the internet via. R using OpenSearch, and collecting the results in a data table?
If you know of a better solution to my problem, I'd appreciate if you could point me in another direction.
If I read well, you are looking for something called Web Scraping via R.
<See me!>

Lotus Smartsuite to "something newer"

I shall try and keep my scenario as brief as possible and to the point.
The office I’m currently working for uses Lotus Smartsuite on Windows 98 / XP, using lots of Lotus Script to tie together Lotus 123 and Lotus Word Pro documents. They also make heavy use of the Lotus Object Linking functions. I shall describe its behaviour below:
You can fill rows and columns in a 123 Spreadsheet with data galore, style it and format it any way you like and define it as a range (nothing unique here). However, you can then copy that range and paste it as a link in a Lotus Word Pro document. This link is then categorised by its range name, so expanding the range back in the 123 file causes the table in the Word Pro Document to expand. This link also carries with it all the formatting and styling of the cells in the 123 Spreadsheet. As I imagine you are now aware, this link is completely live, you can double click anywhere in the object and it opens up the 123 file for editing, and all changes go backward and forward between the two documents. Most of the data retrieved from testing equipment is stored in these 123 spreadsheets and then parts of that are linked into a final Lotus Word Pro report document sent to the customer.
Note: Just to be clear, this is NOT the same as a DDE link in Open Office, which seems to allow for copying of a non-defined range of cells to be imported into a document where all formatting is lost and editing back and forth is not straight forward. It also behaves differently to an OLE object, which seems to only import the entire Spreadsheet rather than a small subsection of it.
However, in recent years, support this older software (Lotus) is becoming more difficult, especially with regards to sending customers documents (Lotus word Pro files are generally unsupported by more modern Office Tools) and technical support for Lotus Smartsuite seems to be practically non-existent these days. Also, with the fear of on going development in a scripting language no-longer being practised by mainstream IT technicians, on-going development and support seems futile. Once the guys who wrote it move on to other things, we will be left with spaghetti script in a language nobody can help us with.
So, we have this goal of "modernising" our IT system by the end of the year. Linux is becoming a very viable option too (No doubt Debian or a derivative), but Open Office doesn't seem to have the linking capability mentioned above. The reason this linking is so important is because the veterans of the office are so used to working this way - storing data in the spreadsheet, linking back to it later in their Word Pro documents, etc. I think they are more than keen to keep this practice going and we have found no equivalent of it in modern office tools (as was requested of me). I can see, as a software engineer (fluent in many languages), how this practice is not the safest or best way of using and storing data (databases spring to mind), but I was wondering if someone could give me a few other good reasons as to why this is bad practice in the work place (I was always in the belief that you should keep your data away from your reporting and formatting, the two should never be entwined - this looks like spreadsheet hell to me) ... or why this is a good thing to keep doing!?
So, for those of you still with me, I guess what I am asking is:
Is this practice of storing data, formatting it in spreadsheets and importing that directly back and forth between word documents good or bad, and what can be done about it? I guess I'll need to prove my point in case either way for this.
Are there ANY modern alternatives to this linking method (regardless of weather it is good or bad practice or not) out there for Linux or Windows? This link MUST carry formatting as well as dynamic range sizes (DDE links don't seem to be the answer).
What would your solution be if you had to start from scratch? Store everything in databases and use SQL to simply ask for the data you need in your word documents? How would you do this? What software would you use?
Any help with this scenario would be more than helpful, or if you know anywhere I should go to ask for advice, that would be appreciated too.
Thank-you for reading!
My suggestion is to first take a step back. What is the benefit to the way things are done now? Is it just a habit that is tough to break? Is there any reason the documents and spreadsheets need to be maintained and linked the way they are, or is it just a requirement because 'that's how it was done before'?
If you can remove that requirement, you have a lot more options and you're building a system that's easier to understand and maintain.
Regarding question 1, I believe there's nothing wrong with storing data in spreadsheets, especially if the end-users need to create and maintain them and development staff is limited. Some questions are whether that data needs to be secured, is related between spreadsheets, is duplicated across the company, or should be shared in a better way across the company. If any of those are true then a centralized database would make more sense. Personally I'd want any valuable data safely stored in a database where it can be managed, access to it can be controlled, it can be easily backed-up, etc.
Regarding question 2, you can do the same thing in Microsoft Office. You can either link the documents, so that the data stays in the source excel doc but appears in the word doc, or you can embed the excel spreadsheet within the word doc.
You might want to look at Microsoft Access for storing the data and generating reports. Or you could build an application using a relational database back-end and reporting front-end. The possibilities are wide-open. It really depends on where the expertise lies within the company.
If it were me I'd probably use a SQL Express back-end (it's free) and a custom ASP.NET MVC application for generating the reports, but that's just where my expertise lies.

data collection for statistics: from web to a database

I'm a statistician by trade and I'd like some recommendations on how to set up a website that can collect data into a database. For personal use, I use Google Forms to collect data, and everything gets populated into a spreadsheet. However, this may not be appropriate in a more professional setting, especially when we have multiple pages/forms. I imagine two uses:
A website where I can send the link to others so they can fill out, similar to Google Forms.
A website where only authorized users can log in to fill out data. Think of a setting where patients are followed periodically in a research study. It'd be cool to have the clinician enter the data directly into the database as he/she fills out the forms as opposed to having another data analyst transcribe his written forms into the database.
The obvious solution would be to hire a web developer. However, I like doing things myself when they are manageable. I imagine a web developer would have to know html, php, and database knowledge (eg, MySQL or PostgreSQL). My experience in these are limited to setting up a wordpress blog on my linux server. My experience with html is also limited as I use emacs org-mode to generate them from plain text. I hope to hear about solutions with a minimal learning curve. My preference of course would be free open source software and Linux-based, but I'd like to hear all available solutions (our data manager is a Windows user).
I recently read a post on Linux Journal that mentions REDCap, but it seems you have to get institutional permission to use.
I also tagged "R" on this post as I'd like to hear what R users are doing about data collection. I'll ultimately analyze the data with R, but all data analysis begins with the scientific question and data collection.
Thanks!
UPDATE 10/4/2010: Thanks everyone for the responses so far. It appears most of the third-party solutions proposed so far has data housed in a database hosted by the vendor. I'd like to house all data in our SQL Server. That is, data entry from the web enters the database in real time, ready for data analysis.
Maybe the limesurvey.org project is of interest ...
It sounds to me like you've got yourself a med study. There are a plethora of concerns that come to mind just from what you've described you want to do. Not the least of which is privacy. Where is it going to be hosted? Have you received consent from the patients to be collecting and transmitting their information electronically? What data are you storing, if any, that could combine to present their identity.
Personally, I steer clear of DIY online data collection tools. I pay a firm, like Ipsos, Research Now/E-Rewards, to program and manage data collection using questionnaires that I have designed. The reason is, knowing how to design research and analyze data is one thing. But if you've been trained in statistics - I can safely argue that you "don't know shit" about data collection. Sure you may know a bunch about sampling theory, but when it comes to getting data in - it's best to leave it to the pros.
There are a number of "industrial quality" online data collection tools available.
Confirmit (Pretty much the gold standard for online data collection)
DASH (Smaller following, but incredibly flexible)
There are also purely web based solutions, some of which are free (not that I would recommend using them)
QuestionPro
SurveyMonkey
Zoomerang
Although, unless you're doing a study with over 50 patients, I would just recommend getting the Physicians or their assistants to fill out Excel sheets and send them to your co.
Also, it's unlikely that you'll need to set-up a username/password system. What you want is referred to as an "open-link". Where respondents click a link and enter information, identifier info can be added by the respondent. You don't need a password because people can only INPUT information, not read it.
Most of the systems I mentioned above work on the idea of emailing a respondent (a clinician) with a link to a web based survey. Which could be easily adapted to your specific needs and act as a reminder to the clinician to fill out the form.
If your question types are simple. I'm sure you could hire a programmer to put together a website that has the forms you need behind an authorized front end. PHP/MySQL would likely do the trick. But, I would double check the privacy laws in your jurisdiction surrounding medical research before going ahead.
I have conducted medial research using an online form (actually two of them). My questions were quite discrete and particular to the disease I was researching.
Previously in a related project, I had created two or three page questionnaires which were printed and then subjects and surgeons filled out the forms and our research coordinator would enter them into our database. It was a lot of work with lots of room for error. I did not like it. Online forms were much better.
I used SurveyGizmo and was happy with it. I looked at lots of options about two years ago. Google Forms did not exist at that time. I went with SurveryGizmo primarily because they had a a statement (attestation) that they were compliant with HIPAA. I could not ensure security such as ssl connections with the other websites. However in order to get myself into that capability (https connections) I had to buy the enterprise level eventhough on every other capability I could have used the free service. Also SurveyGizmo offered a 50% reduction for non-profits which our research institute qualified for.
SurveryGizmo was easy to design and put into production without having to program myself. It was easy to download the data in csv format and read that straight into R. Although I had some weird issues that I needed help with. I had to use the "old" format for export so that it came as a straigtforward csv. Furthermore, the csv file had the odd feature of the the first TWO rows being header rows. But I solved that problem with the help of stackoverflow.
SurveryGizmo has fantastic logic and piping that enbabled me to only ask relevant questions and thereby not waste the time of my respondents and even more importantly, there were no irrelevant questions to confuse respondents.
Finally, I was able to use SurveyGizmo in such a way that I could also track our (research staff) fulfillment and logistics. For instance we got notification when there were new potential subjects who were interested in participating. We were able to note FedEx tracking numbers along with the records of the corresponding subjects.
Basically it worked well.
The safest platform for collecting confidential survey data is Confirmit. There is a learning curve involved here- you will be coding in VisualSQL, which is only used in Confirmit. The survey responses will export to csv files, where you can analyze your results in R.
If you are collecting any confidential data, or data where respondents need unique access links so they can only see their own version of the survey, you will want to use Confirmit. The data is housed in Confirmit's data center, but their data is much more secure than other vendors (i.e., a third party will not be able to hack into your survey and see an individual's responses, or intercept the data that is being sent from your respondent to Confirmit).

How to scrape websites such as Hype Machine?

I'm curious about website scraping (i.e. how it's done etc..), specifically that I'd like to write a script to perform the task for the site Hype Machine.
I'm actually a Software Engineering Undergraduate (4th year) however we don't really cover any web programming so my understanding of Javascript/RESTFul API/All things Web are pretty limited as we're mainly focused around theory and client side applications.
Any help or directions greatly appreciated.
The first thing to look for is whether the site already offers some sort of structured data, or if you need to parse through the HTML yourself. Looks like there is an RSS feed of latest songs. If that's what you're looking for, it would be good to start there.
You can use a scripting language to download the feed and parse it. I use python, but you could pick a different scripting language if you like. Here's some docs on how you might download a url in python and parse XML in python.
Another thing to be conscious of when you write a program that downloads a site or RSS feed is how often your scraping script runs. If you have it run constantly so that you'll get the new data the second it becomes available, you'll put a lot of load on the site, and there's a good chance they'll block you. Try not to run your script more often than you need to.
You may want to check the following books:
"Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL"
http://www.amazon.com/Webbots-Spiders-Screen-Scrapers-Developing/dp/1593271204
"HTTP Programming Recipes for C# Bots"
http://www.amazon.com/HTTP-Programming-Recipes-C-Bots/dp/0977320677
"HTTP Programming Recipes for Java Bots"
http://www.amazon.com/HTTP-Programming-Recipes-Java-Bots/dp/0977320669
I believe that the most important thing you must analyze is which kind of information do you want to extract. If you want to extract entire websites like google does probably your best option is to analyze tools like nutch from Apache.org or flaptor solution http://ww.hounder.org If you need to extract particular areas on unstructured data documents - websites, docs, pdf - probably you can extend nutch plugins to fit particular needs. nutch.apache.org
On the other hand if you need to extract particular text or clipping areas of a website where you set rules using DOM of the page probably what you need to check is more related to tools like mozenda.com. with those tools you will be able to set up extraction rules in order to scrap particular information on a website. You must take into consideration that any change on a webpage will give you an error on your robot.
Finally, If you are planning to develop a website using information sources you could purchase information from companies such as spinn3r.com were they sell particular niches of information ready to be consume. You will be able to save lots of money on infrastructure.
hope it helps!.
sebastian.
Python has the feedparser module, located at feedparser.org that actually handles RSS in its various flavours and ATOM in its various flavours. No reason to reinvent the wheel.

Recommended method to download tweets based on search terms and store

I would like to download tweets based on certain search terms. I'm aware of HTTP GET and such techniques, but I'm not sure the best way to create a simple executable that downloads the tweets and saves them for subsequent analysis.
Any ideas? I'm a basic programmer - if you say "use curl" I know roughly what you mean but not how to set up an application to run curl commands!
Hence my dilemna.
Thanks in advance!
You absolutely can do it in c# or any other language.
From a very rudimentary standpoint, the Twitter API wiki will tell you how, but I know that's not what you're really asking.
I would suggest getting familiar with a good API such as Tweetsharp which also has methods not only for getting your typical timelines, but also using search. The advantage to this (aside from not having to handle your own serialization, etc.) is that it unifies the timeline and search calls as they are actually slightly different API's.
The downside to this approach though is that you're not going to be able to directly translate it to a mac, unless you write it using Silverlight.
the upside to this approach is that Tweetsharp gives you a number of options on how it gives you the data, which in turn gives you a number of options as to how to save the data.

Resources