R - How would I create a structure to hold data - r

I am using R, and I have some genes which have some meta data (field) such as name, type, and other data.
What I want to do is create some object or dictionary that will allow me to store the data in fields and access it via a key and access its fields. How can I do this in R?
Here is an example:
gene{
name = "geneName",
type = "some type",
...
...
}
I can access the gene as such:
gene <- gene("geneName")
type <- gene.type

Related

How to retrieve data using the rentrez package by giving a list of query names instead of a single one?

So I'm trying to use the rentrez package to retrieve DNA sequence data from GenBank, giving as input a list of species.
What I've done is create a vector for the species I want to query, followed by creating a term where I specify the types of sequence data I want to retrieve, then creating a search that retrieves all the occurrences that match my query, and finally I create data where I retrieve the actual sequence data in fasta file.
library(rentrez)
species<-c("Ablennes hians","Centrophryne spinulosa","Doratonotus megalepis","Entomacrodus cadenati","Katsuwonus pelamis","Lutjanus fulgens","Pagellus erythrinus")
for (x in species){
term<-paste(x,"[Organism] AND (((COI[Gene] OR CO1[Gene] OR COXI[Gene] OR COX1[Gene]) AND (500[SLEN]:3000[SLEN])) OR complete genome[All Fields] OR mitochondrial genome[All Fields])",sep='',collapse = NULL)
search<-entrez_search(db="nuccore",term=term,retmax=99999)
data<-entrez_fetch(db="nuccore",id=search$ids,rettype="fasta")
}
Basically what I'm trying to do is concatenate the results of the queries for each species into a single variable. I began using a for cycle but I see it makes no sense in this form because the data of each new species that is being queried is just replacing the previous one in data.
For some elements of species, there will be no data to retrieve and R shows this error:
Error: Vector of IDs to send to NCBI is empty, perhaps entrez_search or entrez_link found no hits?
In the cases where this error is shown and therefore there is no data for that particular species, I wanted the code to just keep going and ignore that.
My output would be a variable data which would include the sequence data retrived, from all the names in species.
library(rentrez)
species<-c("Ablennes hians","Centrophryne spinulosa","Doratonotus megalepis","Entomacrodus cadenati","Katsuwonus pelamis","Lutjanus fulgens","Pagellus erythrinus")
data <- list()
for (x in species){
term<-paste(x,"[Organism] AND (((COI[Gene] OR CO1[Gene] OR COXI[Gene] OR COX1[Gene]) AND (500[SLEN]:3000[SLEN])) OR complete genome[All Fields] OR mitochondrial genome[All Fields])",sep='',collapse = NULL)
search<-entrez_search(db="nuccore",term=term,retmax=99999)
data[x] <- tryCatch({entrez_fetch(db="nuccore",id=search$ids,rettype="fasta")},
error = function(e){NA})
}

R - accessing variable "name" inside a function

Having 2 or more file paths like below each to be read as separate data frames
file1 = ".data/abc_123.txt"
file2 = ".data/def_324.txt"
To enable batch reading, storing these filenames in to a vector
filesVector = c(file1, file2)
Inside the function used to batch read files, need to access the variable names that are in filesVector
csvToDF = function(filesVector){
for(file in filesVector){
# is there a way to extract variable names `file1` & `file2` inside here so as to create a dataframe with name of file as part of the variable for variable
# in the above example data, it should create two data frames stored as variables `df_file1` and `df_file2`
variable_name = read.csv(file)
}
}
Doing it the way you do the variable names get lost. But as workaround you could name vector elements before you call the function:
names(filesVector) <- c("file1", "file2")
Now you should be able to access these inside the function simply with names(filesVector) or names(filesVector[1]).

extract tree-like data into list in R

I would like to read JSON data from the PubChem-API on Paracetamol and extract 18.1.2 ChEBI Ontology information that is stored therein (see screenshot).
I.e.: I want to get all the entires for each role (i.e. application, biological role and chemical role) in a list structure in R.
For this I get the data via the API and convert it into a R object (chebi). So far so good.
require(httr)
require(jsonlite)
require(data.tree)
# from JSON to R list
qurl = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/classification/JSON?classification_type=simple'
cid = 1983
post = POST(qurl, body = list("cid" = paste(cid, collapse = ',')))
cont_json = try(content(post, type = 'text', encoding = 'UTF-8'), silent = FALSE)
cont = fromJSON(cont_json, simplifyDataFrame = FALSE)
# subset list (i.e. get CHEBI data)
cont_l = cont$Hierarchies$Hierarchy
idx = which(sapply(cont_l, function(x) x$SourceName == 'ChEBI'))
chebi = cont_l[[idx]]
Then from the chebi object I want to retrieve the information which entries each role (i.e. application, biological role, chemical role) contains.
(1) My first idea was to simply extract the Name information. However then I loose the tree-like structure of the data and don't know what belongs to which role.
ch_node = chebi$Node
sapply(ch_node, function(x) x$Information$Name)
(2) Secondly I saw that there's the data.tree package. However I don't know how to convert the chebi object properly.
chebi_tree = as.Node(ch_node) #?
Question: How can I get the role information from the chebi object into a list in R without loosing the tree-like structure?

Create, validate and compare data schema (for data harvest) in ckanr

This is a data harvest exercise. The code below ingests data as JSON. I would like to (a) print/extract the data schema from this data into Schema1. Then (b) compare it against some Schema2, because I need to change some metadata header names and accepted values.
Ckan has a Python "IDatasetform" plugin that I apparently allows this kind of analysis, but I don't know how to do this in R. Thanks
library(ckanr)
ckanr_setup(url = "https://energydata.info")
data <- ckanr:::package_show('6c0d331b-4dca-4815-a006-264745cbb9d0', as = 'json', pretty = TRUE)
data <- jsonlite::fromJSON(data)$result
## extract Schema1 from 'data'
## compare Schema1 vs. Schema2

How to access data frames in list by name

I'm pulling large sets of data for multiple sites and views from Google Analytics for processing in R. To streamline the process, I've added my most common queries to a function (so I only have to pass the profile ID and date range). Each query is stored as a local variable in the function, and is assigned to a dynamically-named global variable:
R version 3.1.1
library(rga)
library(stargazer)
# I would add a dataset, but my data is locked down by client agreements and I don't currently have any test sites configured.
profiles <- ga$getProfiles()
website1 <- profiles[1,]
start <- "2013-01-01"
end <- "2013-12-31"
# profiles are objects containing all the ID's, accounts #, etc.; start and end specify date range as strings (e.g. "2014-01-01")
reporting <- function(profile, start, end){
id <- profile[,1] #sets profile number from profile object
#rga function for building and submitting query to API
general <- ga$getData(id,
start.date = start,
end.date = end,
metrics = "ga:sessions")
... #additional queries, structured similarly to example above(e.g. countries, cities, etc.)
#transforms name of profile object to string
profileName <- deparse(substitute(profile))
#appends "Data" to profile object name
temp <- paste(profileName, "Data", sep="")
#stores query results as list
temp2 <- list(general,countries,cities,devices,sources,keywords,pages,events)
#assigns list of query results and stores it globally
assign(temp, temp2, envir=.GlobalEnv)
}
#call reporting function at head of report or relevant section
reporting(website1,start,end)
#returns list of data frames returned by the ga$getData(...), but within the list they are named "data.frame" instead of their original query name.
#generate simple summary table with stargazer package for display within the report
stargazer(website1[1])
I'm able to access these results through *website1*Data[1], but I'm handing the data off to collaborators. Ideally, they should be able to access the data by name (e.g. *website1*Data$countries).
Is there an easier/better way to store these results, and to make accessing them easier from within an .Rmd report?
There's no real reason to do the deparse in side the function just to assign a variable in the parent environment. If you have to call the reporting() function, just have that function return a value and assign the results
reporting <- function(profile, start, end){
#... all the other code
#return results
list(general=general,countries=countries,cities=cities,
devices=devices,sources=sources,keywords=keywords,
pages=pages,events=events)
}
#store results
websiteResults <- reporting(website1,start,end)

Resources