Background: I work with animals, and I have a way of scoring each pet based on a bunch of variables. This score indicates to me the health of the animal, say a dog.
Best practise:
Let's say I wanted to create a bunch of lookup tables to recreate this scoring model of mine, that I have only stored in my head. Nowhere else. My goal is to import my scoring model into R. What is the best practise of doing this? I'm gonna use it to make a function where I just type in the variables and I will get a result back.
I started writing it directly into Excel, with an idea of importing it into R, but I wonder if this is bad practise?
I have thought of json-files, which I have no experience with, or just hardcoding a bunch of lists in R...
Writing the tables to an excel file (or multiple excel files) and reading them with R is not bad practice. I guess it comes down to the number of excel files you have and your preferences.
R can connect to pretty much any of the most common file types (csv, txt, xlsx, etc.) and you will be fine reading them in R. Another type is the .RData files which are native in R. You can use them with save, load:
df2 <- mtcars
save(df2, file = 'your_path_here')
load(file = 'your_path_here')
Any of the above is fine as long as your data is not too big (e.g. if you start having 100 excel files which you need to update frequently, your data is probably becoming too big to maintain in excel). If that ever happens, then you should consider creating a data base (e.g. MySQL, SQLite, etc.) and storing your data there. R would then connect to the data base to access the data.
Related
I have a large population survey dataset for a project and the first step is to make exclusions and have a final dataset for analyses. To organize my work, I must continue my work in a new file where I derive survey variables correctly. Is there a command used to continue work by saving all the previous data and code to the new file?
I don´t think I understand the problem you have. You can always create multiple .R files and split the code among them as you wish, and you can also arrange those files as you see fit in the file system (group them in the same folder with informative names and comments, etc...).
As for the data side of the problem, you can load your data into R, make any changes / filters needed, and then save it to another file with one of the billions of functions to write stuff to the disk: write.table() from base, fwrite() from data.table (which can be MUCH faster), etc...
I feel that my answer is way too obvious. When you say "project" you mean "something I have to get done" or the actual projects that you can create in rstudio. If it´s the first, then I think I have covered it. If it´s the second, I never got to use that feature so I am not going to be able to help :(
Maybe you can elaborate a bit more.
I work in an environment, where we heavily depend on Excel to do the statistical job. We have our own Excel workbooks that create reports, charts and compute the models. But sometimes the Excel is not enough, so we would like to use R to augment the data processing.
I am developing a fairly universal and low-level Excel workbook that is capable to convert our data structures stored in Excel workbook into R using rcom and RExcel macros. Because data are large, the process of porting them into R is lengthy (in terms of the time a used needs to wait after pressing F9 to recalculate the workbook), so I started to develop caching capabilities to my Excel workbook.
Caching is achieved by embedding an extra attribute to the saved object(s), that is a function that checks if the mtime of the Excel's workbook with the data structure did not change since the time the R object was created. Additionally the template supports saving the objects into disk, so next time it is not mandatory to use the workbook and the original Excel data structures in the first place, when doing calculations that involve mostly R.
Although for the most cases the user wouldn't care, but internally sometimes it is more natural to save the data into one R object (like data.frame), and sometimes it seems that saving a whole set o multiple R objects is more intuitive.
When saving a single R object, the saveRDS is more convenient, so I prefer it over save, which works only for multiple objects. (I know, that I can always render multiple objects into one by combining them in the list)
According to the manual for saveRDS the file generated by save has first 5 bytes equal to ASCII representation of RDXs\n. Is there any ready function to test that, or should I manually open the file asbinary, read the 5 bytes, trap a corner case if the file doesn't have even 5 bytes, close the file, etc.?
I'm wondering if there is a possibility to download (from the website) a subset made from original dataset in Rdata format. The easiest way of course is to proceed in this manner:
set<-url("http://xxx.com/datasets/dataset.RData")
load(set)
subset<-set[set$var=="yyy",]
however I'm trying to speed up my code and avoid downloading unnecessary columns.
Thanks for any feedback.
Matt
There is no mechanism for that task. There is also no mechanism for inspecting .Rdata files. In the past when this has been requested, people have been advised to convert to a real database management system.
I know this is not a new concept by any stretch in R, and I have browsed the High Performance and Parallel Computing Task View. With that said, I am asking this question from a point of ignorance as I have no formal training in Computer Science and am entirely self taught.
Recently I collected data from the Twitter Streaming API and currently the raw JSON sits in a 10 GB text file. I know there have been great strides in adapting R to handle big data, so how would you go about this problem? Here are just a handful of the tasks that I am looking to do:
Read and process the data into a data frame
Basic descriptive analysis, including text mining (frequent terms, etc.)
Plotting
Is it possible to use R entirely for this, or will I have to write some Python to parse the data and throw it into a database in order to take random samples small enough to fit into R.
Simply, any tips or pointers that you can provide will be greatly appreciated. Again, I won't take offense if you describe solutions at a 3rd grade level either.
Thanks in advance.
If you need to operate on the entire 10GB file at once, then I second #Chase's point about getting a larger, possibly cloud-based computer.
(The Twitter streaming API returns a pretty rich object: a single 140-character tweet could weigh a couple kb of data. You might reduce memory overhead if you preprocess the data outside of R to extract only the content you need, such as author name and tweet text.)
On the other hand, if your analysis is amenable to segmenting the data -- for example, you want to first group the tweets by author, date/time, etc -- you could consider using Hadoop to drive R.
Granted, Hadoop will incur some overhead (both cluster setup and learning about the underlying MapReduce model); but if you plan to do a lot of big-data work, you probably want Hadoop in your toolbox anyway.
A couple of pointers:
an example in chapter 7 of Parallel R shows how to setup R and Hadoop for large-scale tweet analysis. The example uses the RHIPE package, but the concepts apply to any Hadoop/MapReduce work.
you can also get a Hadoop cluster via AWS/EC2. Check out
Elastic MapReduce
for an on-demand cluster, or use
Whirr
if you need more control over your Hadoop deployment.
There's a brand-new package called colbycol that lets you read in only the variables you want from enormous text files:
http://colbycol.r-forge.r-project.org/
read.table function remains the main data import function in R. This
function is memory inefficient and, according to some estimates, it
requires three times as much memory as the size of a dataset in order
to read it into R.
The reason for such inefficiency is that R stores data.frames in
memory as columns (a data.frame is no more than a list of equal length
vectors) whereas text files consist of rows of records. Therefore, R's
read.table needs to read whole lines, process them individually
breaking into tokens and transposing these tokens into column oriented
data structures.
ColByCol approach is memory efficient. Using Java code, tt reads the
input text file and outputs it into several text files, each holding
an individual column of the original dataset. Then, these files are
read individually into R thus avoiding R's memory bottleneck.
The approach works best for big files divided into many columns,
specially when these columns can be transformed into memory efficient
types and data structures: R representation of numbers (in some
cases), and character vectors with repeated levels via factors occupy
much less space than their character representation.
Package ColByCol has been successfully used to read multi-GB datasets
on a 2GB laptop.
10GB of JSON is rather inefficient for storage and analytical purposes. You can use RJSONIO to read it in efficiently. Then, I'd create a memory mapped file. You can use bigmemory (my favorite) to create different types of matrices (character, numeric, etc.), or store everything in one location, e.g. using HDF5 or SQL-esque versions (e.g. see RSQlite).
What will be more interesting is the number of rows of data and the number of columns.
As for other infrastructure, e.g. EC2, that's useful, but preparing a 10GB memory mapped file doesn't really require much infrastructure. I suspect you're working with just a few 10s of millions of rows and a few columns (beyond the actual text of the Tweet). This is easily handled on a laptop with efficient use of memory mapped files. Doing complex statistics will require either more hardware, cleverer use of familiar packages, and/or experimenting with some unfamiliar packages. I'd recommend following up with a more specific question when you reach that stage. The first stage of such work is simply data normalization, storage and retrieval. My answer for that is simple: memory mapped files.
To read chunks of the JSON file in, you can use the scan() function. Take a look at the skip and nlines arguments. I'm not sure how much performance you'll get versus using a database.
So I've been trying to read this particular .mat file into R. I don't know too much about matlab, but I know enough that the R.matlab package can only read uncompressed data into R, and to save it as uncompressed I need to save it as such in matlab by using
save new.mat -v6.
Okay, so I did that, but when I used readMat("new.mat") in R, it just got stuck loading that forever. I also tried using package hdf5 via:
> hdf5load("new.mat", load=FALSE)->g
Error in hdf5load("new.mat", load = FALSE) :
can't handle hdf type 201331051
I'm not sure what this problem could be, but if anyone wants to try to figure this out the file is located at http://dibernardo.tigem.it/MANTRA/MANTRA_online/Matlab_Code%26Data.html and is called inventory.mat (the first file).
Thanks for your help!
This particular file has one object, inventory, which is a struct object, with a lot of different things inside of it. Some are cell arrays, others are vectors of doubles or logicals, and a couple are matrices of doubles. It looks like R.matlab does not like cells arrays within structs, but I'm not sure what's causing issues for R to load this. For reasons like this, I'd generally recommend avoiding mapping structs in Matlab to objects in R. It is similar to a list, and this one can be transformed to a list, but it's not always a good idea.
I recommend creating a new file, one for each object, e.g. ids = inventory.instance_ids and save each object to either a separate .mat file, or save all of them, except for the inventory object, into 1 file. Even better is to go to text, e.g via csvwrite, so that you can see what's being created.
I realize that's going around use of a Matlab to R reader, but having things in a common, universal format is much more useful for reproducibility than to acquire a bunch of different readers for a proprietary format.
Alternatively, you can pass objects in memory via R.matlab, or this set of functions + the R/DCOM interface (on Windows).
Although this doesn't address how to use R.matlab, I've done a lot of transferring of data between R and Matlab, in both directions, and I find that it's best to avoid .mat files (and, similarly, .rdat files). I like to pass objects in memory, so that I can inspect them on each side, or via standard text files. Dealing with application specific file formats, especially those that change quite a bit and are inefficient (I'm looking at you MathWorks), is not a good use of time. I appreciate the folks who work on readers, but having a lot more control over the data structures used in the target language is very much worth the space overhead of using a simple output file format. In-memory data transfer is very nice because you can interface programs, but that may be a distraction if your only goal is to move data.
Have you run the examples in http://cran.r-project.org/web/packages/R.matlab/R.matlab.pdf on pages 22 to 24? That will test your ability to read from versions 4 and 5. I'm not sure that R cannot read compressed files. There is an Rcompresssion package in Omegahat.