I am struggling a bit with an analysis I need to do. I have collected data consisting of little owl calls that were recorded along transects. I want to analyse these recordings for similarity, in order to see which recorded calls are from the same owls and which are from different owls. In that way I can make an estimate of the size of the population at my study area.
I have done a bit of research and it seems that the package warbleR seems to be suitable for this. However, I am far from an R expert and am struggling a bit with how to go about this. Do any of you have experience with these types of analyses and maybe have example scripts? It seems to me that I could use the function cross_correlation and maybe make a pca, however in the warbleR vignette I looked at they only do this for different types of calls and not for the same type call from different individuals, so I am not sure if it would work.
to be able to run analyses with warbleR you need to input the data using the "selection_table" format. Take a look at the example data "lbh_selec_table" to get a sense of the format:
library(warbleR)
data(lbh_selec_table)
head(lbh_selec_table)
The whole point of these objects is to tell R the time location in your sound files (in seconds) of the signals you want to analyze. Take a look at this link for more details on this object structure and how to import it into R.
I have the results for spatial clustering, in this results I have the id for some cities in USA. I would like to show this clustering results on a nice map. Is this feasible in R?
Yes, this is feasible.
You need to map the city ids to geographical data, then visualize it.
With the extensive drawing capabilities of R, this is not very hard; there are several R packages that will do the heavy lifting, and tutorials to guide you. Just pick whatever package you prefer.
We cannot give you a complete source, of course, because we don't know what kind of ids you have. For example many people use zip codes, others use FIPS ids, etc.
I have a csv of the following format:
person, location, time_of_day, money_spent
I've been going through and seeing how to format data to make it work with the more popular libraries (see: https://sites.google.com/site/daishizuka/toolkits/sna/sna_data), but they seem to be focused on various formats of expressing the connectedness between each member.
I would like to express extra dimensionality to my network by, say, coloring the nodes and connectors different colors according to time_of_day met, or change size of the various dots by money_spent. Can someone give me some guidance as to how I can do that with an implementation of network graphs in R?
I can figure out how to preprocess my data such that it is compatible; I'm just not getting how to implement things to the liking of the SNA libraries such as igraph.
The networkDynamic R package provides data structures for dynamic networks and some basic functionality for importing and manipulating this type of data. You should then be able to do analysis with network, sna, or igraph packages (disclaimer: I'm one of the maintainers of networkDynamic)
I am doing a project that involves processing large, sparse graphs. Does anyone know of any publicly available data sets that can be processed into large graphs for testing? I'm looking for something like a Facebook friend network, or something a little smaller with the same flavor.
I found the Stanford Large Network Dataset Collection pretty useful.
If you asked nicely, you might be able to get Brian O'Meara's data set for treetapper. It's a pretty nice example of real-world data in that genre. Particularly, you'd probably be interested in the coauthorship data.
http://www.treetapper.org/
http://www.brianomeara.info/
Github's API is nice for building out graphs. I've messed around using the python lib networkx to generate graphs of that network. Here's some sample code if you're interested.
Apologies for the double post, evidently I can only post two links at a time since I have <10 reputation...
DIMACS also has some data sets from their cluser challenge and there's always the Graph500. The Boost Graph Library has a number of graph generators as well.
Depending on what you consider "large", there's the University of Florida Sparse Matrix Collection as well as some DIMACS Road Networks (mostly planar of course).
A few other ones:
Newman's page
Barabasi's page
Pajek software
Arena's page
Network Science
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
What datasets exist out on the internet that I can run statistical analysis on?
The datasets package is included with base R. Run this command to see a full list:
library(help="datasets")
Beyond that, there are many packages that can pull data, and many others that contain important data. Of these, you may want to start by looking at the HistData package, which "provides a collection of small data sets that are interesting and important in the history of statistics and data visualization".
For financial data, the quantmod package provides a common interface for pulling time series data from google, yahoo, FRED, and others:
library(quantmod)
getSymbols("YHOO",src="google") # from google finance
getSymbols("GOOG",src="yahoo") # from yahoo finance
getSymbols("DEXUSJP",src="FRED") # FX rates from FRED
FRED (the Federal Reserve of St. Louis) is really a landmine of free economic data.
Many R packages come bundled with data that is specific to their goal. So if you're interested in genetics, multilevel models, etc., the relevant packages will frequently have the canonical example for that analysis. Also, the book packages typically ship with the data needed to reproduce all the examples.
Here are some examples of relevant packages:
alr3: includes data to accompany Applied Linear Regression (http://www.stat.umn.edu/alr)
arm: includes some of the data from Gelman's "Data Analysis Using Regression and Multilevel/Hierarchical Models" (the rest of the data and code is on the book's website)
BaM: includes data from "Bayesian Methods: A Social and Behavioral Sciences Approach"
BayesDA: includes data from Gelman's "Bayesian Data Analysis"
cat: includes data for analysis of categorical-variable datasets
cimis: from retrieving data from CIMIS, the California Irrigation Management Information System
cshapes: includes GIS data boundaries and data
ecdat: data sets for econometrics
ElemStatLearn: includes data from "The Elements of Statistical Learning, Data Mining, Inference, and Prediction"
emdbook: data from "Ecological Models and Data"
Fahrmeir: data from the book "Multivariate Statistical Modelling Based on Generalized Linear Models"
fEcoFin: "Economic and Financial Data Sets" for Rmetrics
fds: functional data sets
fma: data sets from "Forecasting: methods and applications"
gamair: data for "Generalized Additive Models: An Introduction with R"
geomapdata: data for topographic and Geologic Mapping
nutshell: contains all the data from the "R in a Nutshell" book
nytR: provides access to congressional vote data through the NY Times API
openintro: data from the book
primer: includes data for "A Primer of Ecology with R"
qtlbook: includes data for the R/qtl book
RGraphics: includes data from the "R Graphics" book
Read.isi: access to old World Fertility Survey data
A broad selection on the Web. For instance, here's a massive directory of sports databases (all providing the data free of charge, at least that's my experience). In that directory is databaseBaseball.com, which contains among other things, complete datasets for every player who has ever played professional baseball since about 1915.
StatLib is an other excellent resource--beautifully convenient. This single web page lists 4-5 line summaries of over a hundred databases, all of which are available in flat-file form just by clicking the 'Table' link at the beginning of each data set summary.
The base distribution of R comes pre-packaged with a large and varied collection of datasts (122 in R 2.10). To get a list of them (as well as a one-line description):
data(package="datasets")
Likewise, most packages come with several data sets (sometimes a lot more). You can see those the same way:
data(package="latticeExtra")
data(package="vcd")
These data sets are the ones mentioned in the package manuals and vignettes for a given package, and used to illustrate the package features.
A few R packages with a lot of datasets (which again are easy to scan so you can choose what's interesting to you): AER, DAAG, and vcd.
Another thing i find so impressive about R is its I/O. Suppose you want to get some very specific financial data via the yahoo finance API. Let's say closing open and closing price of S&P 500 for every month from 2001 to 2009, just do this:
tick_data = read.csv(paste("http://ichart.finance.yahoo.com/table.csv?",
"s=%5EGSPC&a=03&b=1&c=2001&d=03&e=1&f=2009&g=m&ignore=.csv"))
In this one line of code, R has fetched the tick data, shaped it to a dataframe and bound it to 'tick_data' all . (Here's a handy cheat sheet w/ the Yahoo Finance API symbols used to build the URLs as above)
http://www.data.gov.uk/data
Recently setup by Tim Berners-Lee
Obviously UK based data, but that shouldn't matter. Covers everything from abandoned cars to school absenteeism to agricultural price indexes
Have you considered Stack Overflow Data Dumps?
You are already familiar with what the data represents i.e. the business logic it tracks
A good start to look for economic data are always the following three addresses:
World Bank - Research Datasets
IMF - Data and Statistics
National Bureau of Economic Research
A nice summary of dataset links for development economists can be found at:
Devecondata
Edit:
The World Bank decided last week to open up a lot of its previously non-free datasets and published them online on its revised homepage. The new internet appearance looks pretty nice as well.
The World Bank - Open Data
Another good site is UN Data.
The United Nations Statistics Division
(UNSD) of the Department of Economic
and Social Affairs (DESA) launched a
new internet based data service for
the global user community. It brings
UN statistical databases within easy
reach of users through a single entry
point (http://data.un.org/). Users can
now search and download a variety of
statistical resources of the UN
system.
http://www.data.gov/ probably has something you can use.
In their catalog of raw data you can set your criteria for the data and find what you're looking for http://www.data.gov/catalog/raw
A bundle of 268 small text files (the worked examples of "The R Book") can be found in The R Book's companion website.
You could look on this post on FlowingData
Collection of over 800 datasets in ARFF format understood by Weka and other data analysis packages, gathered in TunedIT.org Repository.
See the data competition set up by Hadley Wickham for the Data Expo of the ASA Statistical Computing and Statistical Graphics section. The competition is over, the data is still there.
UC Irvine Machine Learning Repository has currently 190 data sets.
The UCI Machine Learning Repository is
a collection of databases, domain
theories, and data generators that are
used by the machine learning community
for the empirical analysis of machine
learning algorithms.
I've seen on your other questions that you are apparently interested in data visualization. Have then a look at many eyes project (form IBM) and the sample data sets.
Similar to data.gov, but european centered is eurostat
http://epp.eurostat.ec.europa.eu/portal/page/portal/statistics/search_database
and there is a chinese statistics departement, too, as mentioned by Wildebeests
http://www.stats.gov.cn/english/statisticaldata/monthlydata/index.htm
Then there are some "social data services" which offer the download of datasets, such as
swivel, manyeyes, timetric, ckan, infochimps..
The FAO offers the aquastat database with data with various water related indicators differentiated by country.
The Naval Oceanography Portal offers, for instance, Fraction of the Moon Illuminated.
The blog "curving normality" has a list of interesting data sources.
Another collection of datasets.
Here's an R package with several agricultural datasets from books and papers. Example analyses included: agridat