Adding and updating data in R packages

Adding and updating data in R packages - r

I'm writing my own R package to carry out some specific analyses for which I make a bunch of API calls to get some data from some websites. I have multiple keys for each API and I want to cycle them for two reasons:
Ensure I don't go over my daily limit
Depending on who is using the package, different keys may be used
All my keys are stored in a .csv file api_details.csv. This file is read by a function that gets the latest usage statistics and returns the key with the most calls available. I can add the .csv file to the package/data folder and it is available when the package is loaded but this presents two problems:
The .csv file is not read properly and all columns names are pasted together to create a single variable name and all values pasted together to create a single observation per row.
As I continue working, I would like to add more keys (and perhaps more details about the keys) to the api_details.csv but I'm not sure about how to do that.
I can save the details as an .RData file but I'm not sure about how it would be updated or read outside of R (by other people). Using .csv means that anyone using the package can easily add/remove some keys.
What's the best method to address 1, 2 above?

Related

Column pruning on parquet files defined as an external table

Context: We store historical data in Azure Data Lake as versioned parquet files from our existing Databricks pipeline where we write to different Delta tables. One particular log source is about 18 GB a day in parquet. I have read through the documentation and executed some queries using Kusto.Explorer on the external table I have defined for that log source. In the query summary window of Kusto.Explorer I see that I download the entire folder when I search it, even when using the project operator. The only exception to that seems to be when I use the take operator.
Question: Is it possible to prune columns to reduce the amount of data being fetched from external storage? Whether during external table creation or using an operator at query time.
Background: The reason I ask is that in Databricks it is possible to use the SELCECT statement to only fetch the columns I'm interested in. This reduces the query time significantly.

As David wrote above, the optimization does happen on Kusto side, but there's a bug with the "Downloaded Size" metric - it presents the total data size, regardless of the selected columns. We'll fix. Thanks for reporting.

Is there a R package to generate a .ris file based on a query in a data bases?

For a scoping study/systematic literature review I would like a package which generates a reference list as a .ris file directly from publisher data bases such as Wiley, PubMed, Science Direct Web of Science and JSTOR.
Is there a package (or a workaround with API) that can "output" all listed resources of a database query as a file / dataframe in R?
I have read about "refwork" and "revtools" so far, but they seem to need .ris data upfront. I am looking for something generating me this file and not me doing this individually (which means ticking results page for page and exporting it).

Excel report using access db (~2gb) as backend substitutable by R with shiny or markdown?

I currently have an Access database which pulls data from an Oracle for various countries and is currently around 1.3 GB. However, more countries and dimensions should be added, which will further increase its size. My estimate would be around 2 GB, hence the title.
Per country, there is one table. These tables are then linked in a second Access db, where the user can select a country through a form which pulls the data from the respective linked table in Access db1, aggregates it by month, and writes it into a table. This table is then queried from Excel and some graphs are displayed.
Also, there is another form where the user can select certain keys, such as business area, and split the data by this. This can be pulled into a second sheet in the same excel file.
The users would not only want to be able to filter and group by more keys, but also be able to customize the period of time for which data is displayed more freely, such as from day xx to day yy aggregated by week or month (currently, only month-wise starting from the first of each month is supported).
Since the current Access-Access-Excel solution seems to me to be quite a cumbersome one, I was wondering whether I might make the extensions required by the users of this report using R and either shiny or markdown. I know that shiny does not allow files larger than 30MB to be uploaded but I would plan on using it offline. I was just not able to find a file size limit for this - or do the same restrictions apply?
I know some R and I think that the data aggregations needed could be done very fast using dplyr. The problem is that the users do not, so the report needs to be highly customizable while requiring no technical knowledge. Since I have no preexisting knowledge of shiny or markdown, I was wondering whether it was worth going through the trouble of learning one enough to implement this report in them.
Would what I want to do be feasible in shiny or R markdown? If so, would it still load quickly enough to be usable?

Getting R to interact with an external application

I have an R script that used to be used on standalone CSV files but now needs to accept similar inputs from another, existing application. What are the typical options to call R from an external application written in Python and to pass data to it?
As a toy example, you could imagine a web application written in Python that needs to send R a dataset and then the R script calculates summary stats and sends back to the application. The size of the input dataset is small. Think of it as one row from a database with approx. 20 fields. The fields are a mix of text and numbers. The number of fields is fixed in this call. In the earlier flow these fields were members of a CSV file line.
Example:
New York, 23456,,25.5, 23/04/2015,, 0, 0, Yes, Yes, Absent
The return from R is something like:
0.87, Demographics, NA, History, NA
PS. I don't mean something like Shiny-R which provides both the front end and back end. Here the external application is pre-existing but just needs a way to call R with its data and get a result back.

I would suggest the rpy2 package from Python to allow the usage of R-Commands and Functions in a python script rather than send and receiving data back and forth to R.
rpy2 main website
Here is a nice tutorial on rpy2.

Have a deeper look at : Rserve

Best format to store incremental data in regularly using R

I have a database that is used to store transactional records, these records are created and another process picks them up and then removes them. Occasionally this process breaks down and the number of records builds up. I want to setup a (semi) automated way to monitor things, and as my tool set is limited and I have an R shaped hammer, this looks like an R shaped nail problem.
My plan is to write a short R script that will query the database via ODBC, and then write a single record with the datetime, the number of records in the query, and the datetime of the oldest record. I'll then have a separate script that will process the data file and produce some reports.
What's the best way to create my datafile, At the moment my options are
Load a dataframe, add the record and then resave it
Append a row to a text file (i.e. a csv file)
Any alternatives, or a recommendation?

I would be tempted by the second option because from a semantic point of view you don't need the old entries for writing the new ones, so there is no reason to reload all the data each time. It would be more time and resources consuming to do that.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex