Looking for big, complex sample data - bigdata

I want to benchmark some (graph) databases and looking for some big, complex datasets. The dataset should have a size between 2 TB and 5 TB. Do you know any sample datasets (maybe open government or science data) which fullfills these criteria?

These should fit your requirements
The 1000 Genomes project makes 260 TB of human genome data available
The Internet Archive is making an 80 TB web crawl available for research
The TREC conference made the ClueWeb09 dataset available a few years back. You'll have to sign an agreement and pay a nontrivial fee (up to $610) to cover the sneakernet data transfer. The data is about 5 TB compressed.
ClueWeb12 is now available, as are the Freebase annotations, FACC1
CNetS at Indiana University makes a 2.5 TB click dataset available
ICWSM made a large corpus of blog posts available for their 2011 conference. You'll have to register (an actual form, not an online form), but it's free. It's about 2.1 TB compressed.
The Proteome Commons makes several large datasets available. The largest, the Personal Genome Project, is 1.1 TB in size.
There are several others over 100 GB in size.

Related

Read only rows meeting a condition from a compressed RData file in a Shiny App?

I am trying to make a shiny app that can be hosted for free on shinyapps.io. Free hosting requires that all data/code to be uploaded is <1GB, and that when running the app the memory used is <1GB at any time.
The data
The underlying data (that I'm uploading) is 1000 iterations of a network with ~3050 nodes. Each interaction between nodes (~415,000 interactions per network) has 9 characteristics--of the origin, destination, and the interaction itself--that I need to keep track of. The app needs to read in data from all 1000 networks for user-selected node(s) meeting user-input(ted?) criteria (those 9 characteristics) and summarize it (in a map & table). I can use 1000 one-per-network RData files (more on format below) and the app works, but it takes ~10 minutes to load, and I'd like to speed that up.
A couple notes about what I've done/tried, but I'm not tied to any of this if you have better ideas.
The data is too large to store as CSVs (and fall under the 1GB upload limit), so I've been saving it as RData files of a data.frame with "xz" compression.
To further reduce size, I've turned the data into frequency tables of the 9 variables of interest
In a desktop version, I created 10 summary files that each contained the data for 100 networks (~5 minutes to load), but these are too large to be read into memory in a free shiny app.
I tried making RData files for each node (instead of by splitting by network), but they're too large for the 1GB upload limit.
I'm not sure there are better ways to package the data (but again, happy to hear ideas!), so I'm looking to optimize processing it.
Finally, a question
Is there a way to read only certain rows from a compressed RData file, based on some value (i.e. nodeID)? This post (quickly load a subset of rows from data.frame saved with `saveRDS()`) makes me think that might not be possible because it's compressed. In looking at other options, awk keeps coming up, but I'm not sure if that would work with an RData file (I only seem to see data.frame/data.table/CSV implementations).

Textfiles for Information Retrieval

I am searching for sample .txt files for information Retrieval.
Would be nice if there are sets of documents(around 20 documents) regarding one topic, e.g., sports, music, etc.
Thanks
There are many datasets available, for instance:
Datasets used to evaluate IR systems:
http://www.daviddlewis.com/resources/testcollections/
More IR datasets:
http://boston.lti.cs.cmu.edu/callan/Data/
A comprehensive list of several datasets:
http://zitnik.si/mediawiki/index.php?title=Datasets
The classic news groups dataset: http://scikit-learn.org/stable/datasets/twenty_newsgroups.html
Much bigger, news articles: http://research.signalmedia.co/newsir16/signal-dataset.html

How to load 35 GB data into R?

I have a data set with the dimension of 20 million records and 50 columns. Now I want to load this data set into R. My machine RAM size is 8 GB and my data set size is 35 GB. I have to run my R code on complete data. So far I tried data.table(fread), bigmemory(read.big.matrix) packages to read that data but not succeeded.Is it possible to load 35 GB data into my machine(8 GB)?
If possible please guide me how to overcome this issue?
Thanks in advance.
By buying more RAM. Even if you manage to load all your data (seems like it's textual), there won't be any room left in memory to do whatever you want to do with the data afterwards.
It's actually probably the only correct answer if you HAVE TO load everything into RAM at once. You probably don't have to, but even so, buying more RAM might be easier.
Look at the cloud computing options, like Azure or AWS or Google Compute Engine.

R, RAM amounts, and specific limitations to avoid memory errors

I have read about various big data packages with R. Many seem workable except that, at least as I understand the issue, many of the packages I like to use for common models would not be available in conjunction with the recommended big data packages (for instance, I use lme4, VGAM, and other fairly common varieties of regression analysis packages that don't seem to play well with the various big data packages like ff, etc.).
I recently attempted to use VGAM to do polytomous models using data from the General Social Survey. When I tossed some models on to run that accounted for the clustering of respondents in years as well as a list of other controls I started hitting the whole "cannot allocate vector of size yadda yadda..." I've tried various recommended items such as clearing memory out and using matrices where possible to no good effect. I am inclined to increase the RAM on my machine (actually just buy a new machine with more RAM), but I want to get a good idea as to whether that will solve my woes before letting go of $1500 on a new machine, particularly since this is for my personal use and will be solely funded by me on my grad student budget.
Currently I am running a Windows 8 machine with 16GB RAM, R 3.0.2, and all packages I use have been updated to the most recent versions. The data sets I typically work with max out at under 100,000 individual cases/respondents. As far as analyses go, I may need matrices and/or data frames that have many rows if for example I use 15 variables with interactions between factors that have several levels or if I need to have multiple rows in a matrix for each of my 100,000 cases based on shaping to a row per each category of some DV per each respondent. That may be a touch large for some social science work, but I feel like in the grand scheme of things my requirements are actually not all that hefty as far as data analysis goes. I'm sure many R users do far more intense analyses on much bigger data.
So, I guess my question is this - given the data size and types of analyses I'm typically working with, what would be a comfortable amount of RAM to avoid memory errors and/or having to use special packages to handle the size of the data/processes I'm running? For instance, I'm eye-balling a machine that sports 32GB RAM. Will that cut it? Should I go for 64GB RAM? Or do I really need to bite the bullet, so to speak, and start learning to use R with big data packages or maybe just find a different stats package or learn a more intense programming language (not even sure what that would be, Python, C++ ??). The latter option would be nice in the long run of course, but would be rather prohibitive for me at the moment. I'm mid-stream on a couple of projects where I am hitting similar issues and don't have time to build new language skills all together under deadlines.
To be as specific as possible - What is the max capability of 64 bit R on a good machine with 16GB, 32GB, and 64GB RAM? I searched around and didn't find clear answers that I could use to gauge my personal needs at this time.
A general rule of thumb is that R roughly needs three times the dataset size in RAM to be able to work comfortably. This is caused by the copying of objects in R. So, divide your RAM size by three to get a rough estimate of your maximum dataset size. Then you can look at the type of data you use, and choose how much RAM you need.
Of course, R can also process data out-of-memory, see the HPC task view. This earlier answer of mine might also be of interest.

Datasets for Running Statistical Analysis on [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
What datasets exist out on the internet that I can run statistical analysis on?
The datasets package is included with base R. Run this command to see a full list:
library(help="datasets")
Beyond that, there are many packages that can pull data, and many others that contain important data. Of these, you may want to start by looking at the HistData package, which "provides a collection of small data sets that are interesting and important in the history of statistics and data visualization".
For financial data, the quantmod package provides a common interface for pulling time series data from google, yahoo, FRED, and others:
library(quantmod)
getSymbols("YHOO",src="google") # from google finance
getSymbols("GOOG",src="yahoo") # from yahoo finance
getSymbols("DEXUSJP",src="FRED") # FX rates from FRED
FRED (the Federal Reserve of St. Louis) is really a landmine of free economic data.
Many R packages come bundled with data that is specific to their goal. So if you're interested in genetics, multilevel models, etc., the relevant packages will frequently have the canonical example for that analysis. Also, the book packages typically ship with the data needed to reproduce all the examples.
Here are some examples of relevant packages:
alr3: includes data to accompany Applied Linear Regression (http://www.stat.umn.edu/alr)
arm: includes some of the data from Gelman's "Data Analysis Using Regression and Multilevel/Hierarchical Models" (the rest of the data and code is on the book's website)
BaM: includes data from "Bayesian Methods: A Social and Behavioral Sciences Approach"
BayesDA: includes data from Gelman's "Bayesian Data Analysis"
cat: includes data for analysis of categorical-variable datasets
cimis: from retrieving data from CIMIS, the California Irrigation Management Information System
cshapes: includes GIS data boundaries and data
ecdat: data sets for econometrics
ElemStatLearn: includes data from "The Elements of Statistical Learning, Data Mining, Inference, and Prediction"
emdbook: data from "Ecological Models and Data"
Fahrmeir: data from the book "Multivariate Statistical Modelling Based on Generalized Linear Models"
fEcoFin: "Economic and Financial Data Sets" for Rmetrics
fds: functional data sets
fma: data sets from "Forecasting: methods and applications"
gamair: data for "Generalized Additive Models: An Introduction with R"
geomapdata: data for topographic and Geologic Mapping
nutshell: contains all the data from the "R in a Nutshell" book
nytR: provides access to congressional vote data through the NY Times API
openintro: data from the book
primer: includes data for "A Primer of Ecology with R"
qtlbook: includes data for the R/qtl book
RGraphics: includes data from the "R Graphics" book
Read.isi: access to old World Fertility Survey data
A broad selection on the Web. For instance, here's a massive directory of sports databases (all providing the data free of charge, at least that's my experience). In that directory is databaseBaseball.com, which contains among other things, complete datasets for every player who has ever played professional baseball since about 1915.
StatLib is an other excellent resource--beautifully convenient. This single web page lists 4-5 line summaries of over a hundred databases, all of which are available in flat-file form just by clicking the 'Table' link at the beginning of each data set summary.
The base distribution of R comes pre-packaged with a large and varied collection of datasts (122 in R 2.10). To get a list of them (as well as a one-line description):
data(package="datasets")
Likewise, most packages come with several data sets (sometimes a lot more). You can see those the same way:
data(package="latticeExtra")
data(package="vcd")
These data sets are the ones mentioned in the package manuals and vignettes for a given package, and used to illustrate the package features.
A few R packages with a lot of datasets (which again are easy to scan so you can choose what's interesting to you): AER, DAAG, and vcd.
Another thing i find so impressive about R is its I/O. Suppose you want to get some very specific financial data via the yahoo finance API. Let's say closing open and closing price of S&P 500 for every month from 2001 to 2009, just do this:
tick_data = read.csv(paste("http://ichart.finance.yahoo.com/table.csv?",
"s=%5EGSPC&a=03&b=1&c=2001&d=03&e=1&f=2009&g=m&ignore=.csv"))
In this one line of code, R has fetched the tick data, shaped it to a dataframe and bound it to 'tick_data' all . (Here's a handy cheat sheet w/ the Yahoo Finance API symbols used to build the URLs as above)
http://www.data.gov.uk/data
Recently setup by Tim Berners-Lee
Obviously UK based data, but that shouldn't matter. Covers everything from abandoned cars to school absenteeism to agricultural price indexes
Have you considered Stack Overflow Data Dumps?
You are already familiar with what the data represents i.e. the business logic it tracks
A good start to look for economic data are always the following three addresses:
World Bank - Research Datasets
IMF - Data and Statistics
National Bureau of Economic Research
A nice summary of dataset links for development economists can be found at:
Devecondata
Edit:
The World Bank decided last week to open up a lot of its previously non-free datasets and published them online on its revised homepage. The new internet appearance looks pretty nice as well.
The World Bank - Open Data
Another good site is UN Data.
The United Nations Statistics Division
(UNSD) of the Department of Economic
and Social Affairs (DESA) launched a
new internet based data service for
the global user community. It brings
UN statistical databases within easy
reach of users through a single entry
point (http://data.un.org/). Users can
now search and download a variety of
statistical resources of the UN
system.
http://www.data.gov/ probably has something you can use.
In their catalog of raw data you can set your criteria for the data and find what you're looking for http://www.data.gov/catalog/raw
A bundle of 268 small text files (the worked examples of "The R Book") can be found in The R Book's companion website.
You could look on this post on FlowingData
Collection of over 800 datasets in ARFF format understood by Weka and other data analysis packages, gathered in TunedIT.org Repository.
See the data competition set up by Hadley Wickham for the Data Expo of the ASA Statistical Computing and Statistical Graphics section. The competition is over, the data is still there.
UC Irvine Machine Learning Repository has currently 190 data sets.
The UCI Machine Learning Repository is
a collection of databases, domain
theories, and data generators that are
used by the machine learning community
for the empirical analysis of machine
learning algorithms.
I've seen on your other questions that you are apparently interested in data visualization. Have then a look at many eyes project (form IBM) and the sample data sets.
Similar to data.gov, but european centered is eurostat
http://epp.eurostat.ec.europa.eu/portal/page/portal/statistics/search_database
and there is a chinese statistics departement, too, as mentioned by Wildebeests
http://www.stats.gov.cn/english/statisticaldata/monthlydata/index.htm
Then there are some "social data services" which offer the download of datasets, such as
swivel, manyeyes, timetric, ckan, infochimps..
The FAO offers the aquastat database with data with various water related indicators differentiated by country.
The Naval Oceanography Portal offers, for instance, Fraction of the Moon Illuminated.
The blog "curving normality" has a list of interesting data sources.
Another collection of datasets.
Here's an R package with several agricultural datasets from books and papers. Example analyses included: agridat

Resources