Export dataset from R to REDCap? - r

I want to export a dataset containing personal IDs and other variables to REDCap. Does anyone know how to do it?
I have found a package called REDCapR but it just contains importing from R.

REDCapR::redcap_write() moves info from a data.frame on your local R instance to the remote REDCap Server.
# Read the dataset for the first time.
result_read1 <- REDCapR::redcap_read_oneshot(redcap_uri=uri, token=token)
ds1 <- result_read1$data
ds1$telephone
# Manipulate a field in the dataset in a VALID way
ds1$telephone <- paste0("(405) 321-000", seq_len(nrow(ds1)))
ds1 <- ds1[1:3, ]
ds1$age <- NULL; ds1$bmi <- NULL # Drop the calculated fields before writing.
# Upload the data to the server.
result_write <- REDCapR::redcap_write(ds1, redcap_uri=uri, token=token)
Documentation and references:
The function's page in the reference manual: https://ouhscbbmc.github.io/REDCapR/reference/redcap_write.html
Troubleshooting document: https://ouhscbbmc.github.io/REDCapR/articles/TroubleshootingApiCalls.html#writing
Other vignettes that describe global approaches to working with R & REDCap: https://ouhscbbmc.github.io/REDCapR/articles/index.html
(I'm the primary REDCapR developer. The REDCapR::redcap_write() function was one of first functions added in 2013.)

Related

How to save an empty data to adlsgen2 using R?

I've explored but didn't find any suggestions.
I want to know how to save an empty file to adlsgen2 using R and then read it back to the same code.
Thanks for the help.
Since you want to write an empty file to ADLS gen2 using R and read it from that location, first create a dataframe.
The dataframe can either be completely empty, or it can be a dataframe with column names but no rows. You can use the following code to create one.
df <- data.frame() #OR
df <- data.frame(<col_name>=<type>)
Once you have created a dataframe, you must establish connection to ADLS gen2 account. You can use the method specified in the following link to do that.
https://saketbi.wordpress.com/2019/05/11/how-to-connect-to-adls-gen2-using-sparkr-from-databricks-rstudio-while-integrating-securely-with-azure-key-vault/
After making the connection you can use read and write functions in R language using the path to ADLS gen2 storage. The following link provides numerous functions that can be used according to your requirement.
https://www.geeksforgeeks.org/reading-files-in-r-programming/
Sharing the solution to my question:
I've used SparkR to create an empty data frame and save the same in the adls after setting up the spark context.
Solution is below:
Step 1: Create a schema
fcstSchema <- structType(structField("ABC", "string",TRUE))
new_df<- data.frame(ABC=NULL,stringsAsFactors=FALSE)
n<-createDataFrame(new_df,fcstSchema)
SparkR::saveDF(n,path = "abfss://<account_name>#<datalake>.dfs.core.windows.net/<path>/",source = "delta",mode = "overwrite")

Cannot access EIA API in R

I'm having trouble accessing the Energy Information Administration's API through R (https://www.eia.gov/opendata/).
On my office computer, if I try the link in a browser it works, and the data shows up (the full url: https://api.eia.gov/series/?series_id=PET.MCREXUS1.M&api_key=e122a1411ca0ac941eb192ede51feebe&out=json).
I am also successfully connected to Bloomberg's API through R, so R is able to access the network.
Since the API is working and not blocked by my company's firewall, and R is in fact able to connect to the Internet, I have no clue what's going wrong.
The script works fine on my home computer, but at my office computer it is unsuccessful. So I gather it is a network issue, but if somebody could point me in any direction as to what the problem might be I would be grateful (my IT department couldn't help).
library(XML)
api.key = "e122a1411ca0ac941eb192ede51feebe"
series.id = "PET.MCREXUS1.M"
my.url = paste("http://api.eia.gov/series?series_id=", series.id,"&api_key=", api.key, "&out=xml", sep="")
doc = xmlParse(file=my.url, isURL=TRUE) # yields error
Error msg:
No such file or directoryfailed to load external entity "http://api.eia.gov/series?series_id=PET.MCREXUS1.M&api_key=e122a1411ca0ac941eb192ede51feebe&out=json"
Error: 1: No such file or directory2: failed to load external entity "http://api.eia.gov/series?series_id=PET.MCREXUS1.M&api_key=e122a1411ca0ac941eb192ede51feebe&out=json"
I tried some other methods like read_xml() from the xml2 package, but this gives a "could not resolve host" error.
To get XML, you need to change your url to XML:
my.url = paste("http://api.eia.gov/series?series_id=", series.id,"&api_key=",
api.key, "&out=xml", sep="")
res <- httr::GET(my.url)
xml2::read_xml(res)
Or :
res <- httr::GET(my.url)
XML::xmlParse(res)
Otherwise with the post as is(ie &out=json):
res <- httr::GET(my.url)
jsonlite::fromJSON(httr::content(res,"text"))
or this:
xml2::read_xml(httr::content(res,"text"))
Please note that this answer simply provides a way to get the data, whether it is in the desired form is opinion based and up to whoever is processing the data.
If it does not have to be XML output, you can also use the new eia package. (Disclaimer: I'm the author.)
Using your example:
remotes::install_github("leonawicz/eia")
library(eia)
x <- eia_series("PET.MCREXUS1.M")
This assumes your key is set globally (e.g., in .Renviron or previously in your R session with eia_set_key). But you can also pass it directly to the function call above by adding key = "yourkeyhere".
The result returned is a tidyverse-style data frame, one row per series ID and including a data list column that contains the data frame for each time series (can be unnested with tidyr::unnest if desired).
Alternatively, if you set the argument tidy = FALSE, it will return the list result of jsonlite::fromJSON without the "tidy" processing.
Finally, if you set tidy = NA, no processing is done at all and you get the original JSON string output for those who intend to pass the raw output to other canned code or software. The package does not provide XML output, however.
There are more comprehensive examples and vignettes at the eia package website I created.

Using X4R package in R to connect to an SSAS (local) data cube?

I am trying to figure out how to use X4R package to load a local data cube into R. I can load the cube into Excel, but am failing to get this to work with R.
My cube filename is "\scrsvr\Users\\Documents\Projects\Raw data\data.cub".
I tried the following code:
library(X4R)
handle<-xmlaConnect(url="\\scrsvr\\Users\\<MyName>\\Documents\\Projects\\Raw data\\data.cub")
which produces no error. However, trying to use xmlaDiscover results in the following error:
first argument is not an open XMLA handle
Can anyone help me to connect to my local cube? I am totally new to this, so perhaps someone has a working example that I could follow (with example data cube file)?
Thanks!
Ok, just in case anyone else needs to do this in future, I managed to solve my problem
# use some necessary packages
require(RDCOMClient)
require(data.table)
library(plyr)
library(Hmisc)
#create connection to SSAS datacube
con <- COMCreate("ADODB.Connection")
con[["ConnectionString"]] <- paste(
"Provider=MSOLAP.5",
"Data Source=\\\\scrsvr\\Users\\<MyName>\\Documents\\Projects\\Raw data\\data.cub",
"Persist Security Info=True",
sep = ";")
con$Open()
# define the MDX query here:
query = "SELECT ..."
rs <- COMCreate("ADODB.RecordSet")
# submit the MDX query to the cube
rs$Open(query, con)
rs$MoveFirst() # move to the first row of the record set
nc <- rs$Fields()$Count() # define number of columns
# get the data into a data array:
dd <- vector("list", length=nc)
dd <- rs$GetRows() # get the raw data from the results
For the connection, I used the Provider ("MSOLAP.5") that was configured when I set up a data connection from MS Excel to the data cube. It all works and I can now query the database using MDX directly.

load package data with separate name

I want to load different package data but assign them in separate object. There are some packages which has data with same name. I want to load them but as a separate object. For example;
data("milk", package = "EMSC")
data("milk", package = "baseline")
But the later will replace the previous. So, I want to assign them on object Eg. milk.emsc and milk.baseline.
Is there any efficient and simple solution for this?
Since I came to this question after a long time, I will write my answer that I came up with in case someone has the same problem.
local({
data("milk", package="baseline", envir=environment())
assign(x="milk_baseline", envir=.GlobalEnv, value=milk)
})
local({
data("milk", package="EMSC", envir=environment())
assign(x="milk_emsc", envir=.GlobalEnv, value=milk)
})
This way the global environment will be clean and will only have two data sets with same name from two different packages.

Package a large data set

Column-wise storage in the inst/extdata directory of a package, as suggested by Jan, is now implemented in the dfunbind package.
I'm using the data-raw idiom to make entire analyses from the raw data to the results reproducible. For this, datasets are first wrapped in R packages which can then be loaded with library().
One of the datasets I'm using is largish, around 8 million observations with about 80 attributes. For my current analysis I only need a small fraction of the attributes, but I'd like to package the entire dataset anyway.
Now, if it is simply packaged as a data frame (e.g., with devtools::use_data()), it will be loaded in its entirety when first accessing it. What would be the best approach to package this kind of data so that I can lazy-load at the column level? (Only those columns which I'm actually accessing are loaded, the others happily stay on disk and don't occupy RAM.) Would the ff package help? Can anyone point me to a working example?
I think, I would store the data in inst/extdata. Then create a couple of functions in your package that can read and return parts of that data. In your functions you can get the path to your data using: system.file("extdata", "yourfile", package = "yourpackage"). (As on the page you linked to).
The question then is in what format you store your data and how do you obtain selections from it without reading the data in memory. For that, there are a large number of options. To name some:
sqlite: Store your data in a sqlite database. You can then perform queries on this data using the rsqlite package.
ff: store your data in ff objects (e.g. save using the save.ffdf function from ffbase; use load.ffdf to load again). ff doesn't handle character fields well (they are always converted to factors). And in theory the files are not cross platform although as long as you stay on intel platforms you should be ok.
CSV: store your data in a plain old csv file. You can then make selections from this file using the LaF package. The performance will probably be less than with ff but might be good enough.
RDS: store each of your columns in a seperate RDS file (using saveRDS) and load them using readRDS the advantage is that you do not depend on any R-packages. This is fast. The disadvantage is that you cannot do row selections (but that does not seem to be the case).
If you only want to select columns, I would go with RDS.
A rough example using RDS
The following code creates an example package containing the iris data set:
load_data <- function(dataset, columns) {
result <- vector("list", length(columns));
for (i in seq_along(columns)) {
col <- columns[i]
fn <- system.file("extdata", dataset, paste0(col, ".RDS"), package = "lazydata")
result[[i]] <- readRDS(fn)
}
names(result) <- columns
as.data.frame(result)
}
store_data <- function(package, name, data) {
dir <- file.path(package, "inst", "exdata", name)
dir.create(dir, recursive = TRUE)
for (col in names(data)) {
saveRDS(data[[col]], file.path(dir, paste0(col, ".RDS")))
}
}
packagename <- "lazyload"
package.skeleton(packagename, "load_data")
store_data(packagename, "iris", iris)
After building and installing the package (you'll need to fix the documentation, e.g. delete it) you can do:
library(lazyload)
data <- load_data("iris", "Sepal.Width")
To load the Sepal.Width column of the iris data set.
Of course this is a very simple implementation of load_data: no error handling, it assumes all column exist, it does not know which columns exist, it does not know which data sets exist.

Resources