I am trying to obtain a database that comes from Mongo DB to R, so I can make anlaysis on it. The bridge between these two is a R package: Rmongo.
As I have some policy rules, I cannot show you the dataset and my output, so I will try to explain as best as possible.
My two first commands, after installing the package, are these ones:
mg1 <- mongoDbConnect("test", "localhost", 27018)
dbShowCollections(mg1)
Which works, as it shows the collection, or the different variables.
Then, I can use the commands made by the Rmongo package, meaning:
query = dbGetQuery(mg1, 'address_history','{}')
This normally returns a data frame with all the variables on each column. But, because it is a nested file, I only get the first three variables (out of around fifty) because they are at the top of the nest. For the rest, I get one column of the data frame with the json code (so of approximately 50 variables) that I cannot seem to turn in a data frame. If someone is familiar with that, please help me.
I already saw on Stack Overflow a way to do it manually thanks to gsub, and in general pattern with the code, but this code is dissimilar, and doing it manually will not make it work.
Furthermore, there is also another command via the Rmongo package:
query2 = dbGetQueryForKeys(mg1, 'address_history', '{}', '{address:1}')
where I can return the variable that I want. Unfortunately, because this is a nested file, it also cannot find the variables that are not in the top of the nest.
Is there another command or another package that I can use? I am open to any other opportunity to get this dataset (very large) into an R data frame, so I can make any inferences.
Thank you very much!
I tried just now setting up Rmongo and mongolite for R. I got mongolite working in minutes with the starter data locally . I could not get even get the data I wanted inserted using Rmongo.
I think if you try installing mongolite you will find their documentation and package simpler. https://github.com/jeroen/mongolite
Related
I am working on applying the ptw package to my GC-MS wine data. So far I have been able to correctly use this package on the apples example data described in the vignette (MTBLS99). Since I am new to R, I am unable to get my .CDF files into the format they used to start the vignette. They started with three data frames (All.pks, All.tics, All.xset). I assume that this was generated using the xcms package. But I cannot recreate the specific steps used for the data to be formatted in this manner. Has anyone successfully applied 'ptw' to their LC/GC-MS data? can someone share the code used for generating the All.pks, All.tics, All.xset data frames?
Q. How does one write an R package function that references an R package dataset in a way that is simple/friendly/efficient for an R user. For example, how does one handle afunction() that calls adataset from a package?
What I think may not be simple/friendly/efficient:
User is required to run
data(adataset) before running
afunction(...)
or else receiving an Error: ... object 'adataset' not found. I have noticed some packages have built-in datasets that can be called anytime the package is loaded, for example, iris, which one can call without bringing it to the Global Environment.
Possible options which I have entertained:
Write data(NamedDataSet) directly into the function. Is this a bad idea. I thought perhaps it could be, looking at memory and given my limiting understanding of function environments.
Code the structure of the dataset directly into the function. I think this works depending on the size of the data but it makes me wonder about how to go about proper documentation in the package.
Change Nothing. Given a large enough dataset, maybe it does not make sense to implement a way different from reading it before calling the function.
Any comments are appreciated.
You might find these resources about writing R data packages useful:
the "External Data" section of R Packages
Creating an R data package
Creating an R data package (overview)
In particular, take note of the DESCRIPTION file and usage of the line LazyData: true. This is how datasets are made available without having to use data(), as in the iris example that you mention.
I'm trying to use the DESeq2 package in R for differential gene expression, but I'm having trouble creating the required RangedSummarizedExperiment object from my input data. I have found several tutorials and vignettes for doing this, but they all seem to apply to a raw data set that is different from mine. My data has gene names as row names and patient id as column names, and the data is simply integer count data. There has to be a simple way to create the RangedSummarizedExperiment object from this type of input data, but I haven't yet found a way. Can anybody help? Thanks.
I had a similar problem understanding how to use this data structure. I eventually managed to do without it by using DESeqDataSetFromMatrix. You can see an example in the first code block of Modify r object with rpy2 (this code is pure R, rpy2 stuff comes after). In this example, I have genes as rows and samples as columns, so it is likely you will be able to adopt the same approach.
I am using the library(eventstudies)(Event Studies Package). In the sample they use:
(data(StockPriceReturns))
(data(SplitDates))
(head(SplitDates))
However I do not know how to set up my own dataset to use the package. My quesiton is:
How to look into the StockPriceReturns data?
I appreciate your answer!
I think you want to read a data set into a data frame or table.
I'm not familiar with that package, so I'm not sure about required format. If the data set you read in matches the schema of StockPriceReturns, I'm sure R will process it just fine. This PDF appears to explain it well.
How can I use the R packages zoo or xts with very large data sets? (100GB)
I know there are some packages such as bigrf, ff, bigmemory that can deal with this problem but you have to use their limited set of commands, they don't have the functions of zoo or xts and I don't know how to make zoo or xts to use them.
How can I use it?
I've seen that there are also some other things, related with databases, such as sqldf and hadoopstreaming, RHadoop, or some other used by Revolution R. What do you advise?, any other?
I just want to aggreagate series, cleanse, and perform some cointegrations and plots.
I wouldn't like to need to code and implement new functions for every command I need, using small pieces of data every time.
Added: I'm on Windows
I have had a similar problem (albeit I was only playing with 9-10 GBs). My experience is that there is no way R can handle so much data on its own, especially since your dataset appears to contain time series data.
If your dataset contains a lot of zeros, you may be able to handle it using sparse matrices - see Matrix package ( http://cran.r-project.org/web/packages/Matrix/index.html ); this manual may also come handy ( http://www.johnmyleswhite.com/notebook/2011/10/31/using-sparse-matrices-in-r/ )
I used PostgreSQL - the relevant R package is RPostgreSQL ( http://cran.r-project.org/web/packages/RPostgreSQL/index.html ). It allows you to query your PostgreSQL database; it uses SQL syntax. Data is downloaded into R as a dataframe. It may be slow (depending on the complexity of your query), but it is robust and can be handy for data aggregation.
Drawback: you would need to upload data into the database first. Your raw data needs to be clean and saved in some readable format (txt/csv). This is likely to be the biggest issue if your data is not already in a sensible format. Yet uploading "well-behaved" data into the DB is easy ( see http://www.postgresql.org/docs/8.2/static/sql-copy.html and How to import CSV file data into a PostgreSQL table? )
I would recommend using PostgreSQL or any other relational database for your task. I did not try Hadoop, but using CouchDB nearly drove me round the bend. Stick with good old SQL