What are some good online sources for data sets? - test-data

Some time ago I came across as site online who's sole purpose was the collection of various data sets, location data, district census data, or whatever sets community members were interested in maintaining.
My question is, do you know the site that I'm thinking of, or can you suggest any other sites that perform a similar service?
I'll suggest GeoNames, a great source for zip/postal, lat/long and lots of other geographical infomation.

Take a look at these websites:
IBM Many Eyes
Swivel
Data 360

Here are some others out of my bookmarks:
DatabaseAnswers.org
Discogs
Freebase

Another that was just shown to me is sig.ma, a search tool for retrieving ontologies. If you're building web 2.0 services, this would be a great tool for bootstrapping.

I use the Common Data Hub a lot:
http://www.commondatahub.com/home
especially for tables of standardized data.

buzzdata: http://buzzdata.com/ seems like a pretty good way to publish and share large data sets to me.

What about sports: The baseball archive has stats about everything baseball from 1871...
http://baseball1.com/statistics - great source of date if one wants to train some statistics.

The most comprehensive dataset for music is MusicBrainz http://musicbrainz.org/

Search
Edit: I don't mean search generally, I mean to say: Amazon is hosting a few selected public domain data sets for all to use. You can find out what data sets they have available by searching.

Related

How google scrape stackoverflow all questions while searching?

I am making search engine. But I want to know, How google scrapes all data of stackoverflow.
As my intuition,
Do they save all stackoverflow data in csv file?
and when user types some coding question, use some algorithm and recommend users.
or anything else,
Or Anything else?
Thank you for help.
Storing every data in a csv and running a search will probably cost you hours to retrieve a result.
Google Search Engine works in 3 stages, Crawling, Indexing, and Serving.
These algorithms working in these 3 stages are what makes Google so powerful. As they are fine tuned after years of optimization and learning, they can accurate index each webpage and analyze them without just plainly storing everything, just what you suggested.
Reference: How Search Works

Storing and downloading Data in iOS Applications

I am a bit new to iOS Development and I was wondering if someone could point me in the right direction regarding an application I am working on.
I am currently working on an application that will be displaying product lists and categories. The list is updated on a weekly basis (one every week).
I am now trying to decide two things:
1- What's the best method of storing this data, I am looking for a way that will allow me to replace the data in the application once every week.
2- Is it going to be beneficial to use CoreData? Note that I Only have Product Category, Product and Product Information entities.
Appreciate your support.
I would use Core Data. Because I know Core Data and am used to work with it. But this is clearly very much like using a chainsaw to cut a slice of bread.
As I understand, you're not familiar with Core Data. Maybe it's not the right tool for the job considering the learning curve.
In your case I would simply use JSON files as provided by the server.
That said, if your looking in Core Data anyway, any store will do, either atomic, XML or SQLite. The first two will load the whole data set in memory and queries will be done in memory as well. SQLite provides the benefits usually associated with databases, with a slightly increased complexity. A chainsaw.
I would use Core Data. If you haven't worked with Core Data before, learn it. It's a great framework.

data collection for statistics: from web to a database

I'm a statistician by trade and I'd like some recommendations on how to set up a website that can collect data into a database. For personal use, I use Google Forms to collect data, and everything gets populated into a spreadsheet. However, this may not be appropriate in a more professional setting, especially when we have multiple pages/forms. I imagine two uses:
A website where I can send the link to others so they can fill out, similar to Google Forms.
A website where only authorized users can log in to fill out data. Think of a setting where patients are followed periodically in a research study. It'd be cool to have the clinician enter the data directly into the database as he/she fills out the forms as opposed to having another data analyst transcribe his written forms into the database.
The obvious solution would be to hire a web developer. However, I like doing things myself when they are manageable. I imagine a web developer would have to know html, php, and database knowledge (eg, MySQL or PostgreSQL). My experience in these are limited to setting up a wordpress blog on my linux server. My experience with html is also limited as I use emacs org-mode to generate them from plain text. I hope to hear about solutions with a minimal learning curve. My preference of course would be free open source software and Linux-based, but I'd like to hear all available solutions (our data manager is a Windows user).
I recently read a post on Linux Journal that mentions REDCap, but it seems you have to get institutional permission to use.
I also tagged "R" on this post as I'd like to hear what R users are doing about data collection. I'll ultimately analyze the data with R, but all data analysis begins with the scientific question and data collection.
Thanks!
UPDATE 10/4/2010: Thanks everyone for the responses so far. It appears most of the third-party solutions proposed so far has data housed in a database hosted by the vendor. I'd like to house all data in our SQL Server. That is, data entry from the web enters the database in real time, ready for data analysis.
Maybe the limesurvey.org project is of interest ...
It sounds to me like you've got yourself a med study. There are a plethora of concerns that come to mind just from what you've described you want to do. Not the least of which is privacy. Where is it going to be hosted? Have you received consent from the patients to be collecting and transmitting their information electronically? What data are you storing, if any, that could combine to present their identity.
Personally, I steer clear of DIY online data collection tools. I pay a firm, like Ipsos, Research Now/E-Rewards, to program and manage data collection using questionnaires that I have designed. The reason is, knowing how to design research and analyze data is one thing. But if you've been trained in statistics - I can safely argue that you "don't know shit" about data collection. Sure you may know a bunch about sampling theory, but when it comes to getting data in - it's best to leave it to the pros.
There are a number of "industrial quality" online data collection tools available.
Confirmit (Pretty much the gold standard for online data collection)
DASH (Smaller following, but incredibly flexible)
There are also purely web based solutions, some of which are free (not that I would recommend using them)
QuestionPro
SurveyMonkey
Zoomerang
Although, unless you're doing a study with over 50 patients, I would just recommend getting the Physicians or their assistants to fill out Excel sheets and send them to your co.
Also, it's unlikely that you'll need to set-up a username/password system. What you want is referred to as an "open-link". Where respondents click a link and enter information, identifier info can be added by the respondent. You don't need a password because people can only INPUT information, not read it.
Most of the systems I mentioned above work on the idea of emailing a respondent (a clinician) with a link to a web based survey. Which could be easily adapted to your specific needs and act as a reminder to the clinician to fill out the form.
If your question types are simple. I'm sure you could hire a programmer to put together a website that has the forms you need behind an authorized front end. PHP/MySQL would likely do the trick. But, I would double check the privacy laws in your jurisdiction surrounding medical research before going ahead.
I have conducted medial research using an online form (actually two of them). My questions were quite discrete and particular to the disease I was researching.
Previously in a related project, I had created two or three page questionnaires which were printed and then subjects and surgeons filled out the forms and our research coordinator would enter them into our database. It was a lot of work with lots of room for error. I did not like it. Online forms were much better.
I used SurveyGizmo and was happy with it. I looked at lots of options about two years ago. Google Forms did not exist at that time. I went with SurveryGizmo primarily because they had a a statement (attestation) that they were compliant with HIPAA. I could not ensure security such as ssl connections with the other websites. However in order to get myself into that capability (https connections) I had to buy the enterprise level eventhough on every other capability I could have used the free service. Also SurveyGizmo offered a 50% reduction for non-profits which our research institute qualified for.
SurveryGizmo was easy to design and put into production without having to program myself. It was easy to download the data in csv format and read that straight into R. Although I had some weird issues that I needed help with. I had to use the "old" format for export so that it came as a straigtforward csv. Furthermore, the csv file had the odd feature of the the first TWO rows being header rows. But I solved that problem with the help of stackoverflow.
SurveryGizmo has fantastic logic and piping that enbabled me to only ask relevant questions and thereby not waste the time of my respondents and even more importantly, there were no irrelevant questions to confuse respondents.
Finally, I was able to use SurveyGizmo in such a way that I could also track our (research staff) fulfillment and logistics. For instance we got notification when there were new potential subjects who were interested in participating. We were able to note FedEx tracking numbers along with the records of the corresponding subjects.
Basically it worked well.
The safest platform for collecting confidential survey data is Confirmit. There is a learning curve involved here- you will be coding in VisualSQL, which is only used in Confirmit. The survey responses will export to csv files, where you can analyze your results in R.
If you are collecting any confidential data, or data where respondents need unique access links so they can only see their own version of the survey, you will want to use Confirmit. The data is housed in Confirmit's data center, but their data is much more secure than other vendors (i.e., a third party will not be able to hack into your survey and see an individual's responses, or intercept the data that is being sent from your respondent to Confirmit).

how to find what isbns are in use

I am trying to find a list of what ISBNs are in use. I guess I could scrape a website like Amazon but that would waste a lot of bandwidth. Is there a better (free) way?
Maybe you could use the remote API for isbndb.com.
Trying to keep an enormous ISBN list up-to-date yourself is quite a huge task if you ask me.
Just for the record: note that if you actually want an ISBN for your publication, you need to go to the official agency in your country. In the US this is http://www.isbn.org/ , but it varies by country. In Australia, for example, it is here.
This might help: What is the most complete (free) ISBN API?
As the accepted answer states there is also an API to search Amazon but it's not actually supposed to be used in the way you wish to.
ended up using partial list from http://my.linkbaton.com/isbn/
Yes, try isbndb.com

Sample Data Creation Tool (mainly for Databases)

I’m thinking through some database design concepts and believe that creating sample data simulating real-world volume of my application will help solidify some design decisions.
Does any anyone know of a tool to create sample data? I’m looking for something that’s database and platform neutral if possible (from MySQL to DB/2 and Windows to UNIX) so to test the design across different systems/architectures. I’m visioning some tool that you can:
point to a database table(s) (some configuration of the DSN, etc.)
introspect the fields and based on the field... (point-and-click or add some configuration)
have a means for expressing how to create sample data (MySQL Sample Data Creator is the kind of thing I vision but I think their'd be some more options like commit frequency so to create very large data sets... millions or billions of rows... don't think this tool would scale to the volume of data I want to create)
push a button and go (depending on your parameters, this may take a long time)
Any thoughts? Sure, I could write an app to do this but it seems so generic that I shouldn’t have to reinvent the wheel.
DBMonster is fine but I prefer databene benerator as I explained it in this answer to a similar question.
Something like DBMonster?
This page also has a listing of many DB data generators.
I cannot help you with MySQL or DB/2 but, in case anyone gets to this answer with a need for MS SQL Server, I can recommend the Data Generator from Red Gate.
Our test data generator, Datanamic DB Data Generator can do this for you. Works with MySQL. It uses default "generator settings" when loading your tables the first time. You can then "fine-tune" the fields and/or choose other "generators".

Resources