R - downloading a subset from original dataset - r

I'm wondering if there is a possibility to download (from the website) a subset made from original dataset in Rdata format. The easiest way of course is to proceed in this manner:
set<-url("http://xxx.com/datasets/dataset.RData")
load(set)
subset<-set[set$var=="yyy",]
however I'm trying to speed up my code and avoid downloading unnecessary columns.
Thanks for any feedback.
Matt

There is no mechanism for that task. There is also no mechanism for inspecting .Rdata files. In the past when this has been requested, people have been advised to convert to a real database management system.

Related

R Social Network Analysis/Data Manipulation Question: Reading in .edges, .circles, .egofeat, .feat, and .featnames files

So I'm working with a network dataset from Stanford's SNAP Datasets and "SNAP" has wrappers for Python and C++ but not R - however, the data is still usable since I believe it's a mix of CSV files.
I can actually read in the .edges file and form an igraph object but want to read in the other files, get the attributes & add those attributes to the igraph object for analysis. I'm just confused on how to work with the .circles, .egofeat, .feat, and .featnames files since the documentation on the dataset is very scarce. Hoping someone has worked with the dataset in R or even another language and has any pointers to get started.
Thank you!

How to continue project in new file in R

I have a large population survey dataset for a project and the first step is to make exclusions and have a final dataset for analyses. To organize my work, I must continue my work in a new file where I derive survey variables correctly. Is there a command used to continue work by saving all the previous data and code to the new file?
I don´t think I understand the problem you have. You can always create multiple .R files and split the code among them as you wish, and you can also arrange those files as you see fit in the file system (group them in the same folder with informative names and comments, etc...).
As for the data side of the problem, you can load your data into R, make any changes / filters needed, and then save it to another file with one of the billions of functions to write stuff to the disk: write.table() from base, fwrite() from data.table (which can be MUCH faster), etc...
I feel that my answer is way too obvious. When you say "project" you mean "something I have to get done" or the actual projects that you can create in rstudio. If it´s the first, then I think I have covered it. If it´s the second, I never got to use that feature so I am not going to be able to help :(
Maybe you can elaborate a bit more.

Best practise: lookup tables in R

Background: I work with animals, and I have a way of scoring each pet based on a bunch of variables. This score indicates to me the health of the animal, say a dog.
Best practise:
Let's say I wanted to create a bunch of lookup tables to recreate this scoring model of mine, that I have only stored in my head. Nowhere else. My goal is to import my scoring model into R. What is the best practise of doing this? I'm gonna use it to make a function where I just type in the variables and I will get a result back.
I started writing it directly into Excel, with an idea of importing it into R, but I wonder if this is bad practise?
I have thought of json-files, which I have no experience with, or just hardcoding a bunch of lists in R...
Writing the tables to an excel file (or multiple excel files) and reading them with R is not bad practice. I guess it comes down to the number of excel files you have and your preferences.
R can connect to pretty much any of the most common file types (csv, txt, xlsx, etc.) and you will be fine reading them in R. Another type is the .RData files which are native in R. You can use them with save, load:
df2 <- mtcars
save(df2, file = 'your_path_here')
load(file = 'your_path_here')
Any of the above is fine as long as your data is not too big (e.g. if you start having 100 excel files which you need to update frequently, your data is probably becoming too big to maintain in excel). If that ever happens, then you should consider creating a data base (e.g. MySQL, SQLite, etc.) and storing your data there. R would then connect to the data base to access the data.

Downloading Bigish Datasets with bigrquery - Best practices?

I'm trying to download a table of about 250k rows and 500 cols from bigquery into R for some model building in h2o using the R wrappers. It's about 1.1gb when downloaded from BQ.
However, it runs for a long time and then looses the connection so never makes it to R (i'm rerunning now so i can get a more precise example of the error).
I'm just wondering if using bigrquery to do this seems like a reasonable task or is bigrquery mainly for pulling smaller datasets from BigQuery into R.
Just wondering if anyone has any tips and tricks that might be useful - am going through the library code to try figure out exactly how its doing it (was going to see if was an option to shared out the file locally or something even). But not entirely sure i even know what i'm looking at.
Update:
I've gone with quick fix of using the cli's to download the data locally
bq extract blahblah gs://blah/blahblah_*.csv
gsutil cp gs://blah/blahblah_*.csv /blah/data/
And then to read the data just use:
# get file names in case shareded accross multiple files
file_names <- paste(sep='','/blah/data/',list.files(path='/blah/data/',pattern=paste(sep='',my_lob,'_model_data_final')))
# read each file
df <- do.call(rbind,lapply(file_names,read.csv))
Is actually a lot quicker this way - 250k no problem.
I do find that BigQuery could do with a bit better integration into the wider ecosystem of tools out there. Love that R + Dataflow examples, defo going to look into that a bit more.

R package, size of dataset vis-a-vis code

I am designing an R package (http://github.com/bquast/decompr) to run the Wang-Wei-Zhu export decomposition (http://www.nber.org/papers/w19677).
The complete package is only about 79 kilobyte.
I want to supply an example dataset especially because the input objects are somewhat complex. A relevant real world dataset is available from http://www.wiod.org, however, the total size of the .Rdata object would come to about 1 megabyte.
My question therefore is, would it be a good idea to include the relevant dataset that is so much larger than the package itself?
It is not usual for code to be significantly smaller than data. However, I will not be the only one to suggest the following (especially if you want to submit to CRAN):
Consult the R Extensions manual. In particular, make sure that the data file is in a compressed format and use LazyData when applicable.
The CRAN Repository Policies also have a thing or two to say about data files. There is a hard maximum of 5MB for documentation and data. If the code is likely to change and the data are not, consider creating a separate data package.
PDF documentation can also be distributed, so it is possible to write a "vignette" that is not built by running code when the package is bundled, but instead illustrates usage with static code snippets that show how to download the data. Downloading in the vignette itself is prohibited, as the manual states that all files necessary to build it must be available on the local file system.
I also would have to ask if including a subset of the data is not sufficient to illustrate the use of the package.
Finally, if you don't intend to submit to a package repository, I can't imagine a megabyte download being a breach of etiquette.

Resources