Create, validate and compare data schema (for data harvest) in ckanr - r

This is a data harvest exercise. The code below ingests data as JSON. I would like to (a) print/extract the data schema from this data into Schema1. Then (b) compare it against some Schema2, because I need to change some metadata header names and accepted values.
Ckan has a Python "IDatasetform" plugin that I apparently allows this kind of analysis, but I don't know how to do this in R. Thanks
library(ckanr)
ckanr_setup(url = "https://energydata.info")
data <- ckanr:::package_show('6c0d331b-4dca-4815-a006-264745cbb9d0', as = 'json', pretty = TRUE)
data <- jsonlite::fromJSON(data)$result
## extract Schema1 from 'data'
## compare Schema1 vs. Schema2

Related

How to save vegan::simper() output to data table

I'm trying to save vegan::simper() output as a data frame so that I can filter objects and eventually export as a table for publication. However the simper output is of class = list and I'm not sure how to convert this to a data frame. Here is some sample code using Dune.
# Species and environmental data
dune <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/dune2.spe.txt', row.names = 1)
dune.env <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/dune2.env.txt', row.names = 1)
data(dune)
data(dune.env)
(sim <- with(dune.env, simper(dune, Management)))
summary(sim)
class(sim)
To complement the comment by #dcarlson: simper result object is a complicated beast and there is no easy way of getting a table – especially as I don't know what kind of table you are looking for. The result object stores every pair of factor classes in a separate table. You can extract all those tables with summary(sim). If you want to see only one table, for instance for the pair SF_HF, use
summary(sim)$SF_HF
(and to see available names, use names(sim)). Then it is up to you collect the table you desire from these individual tables. All information is there.
And read the warnings in the manual page.
If you want to get something similar as the short printed output, look at vegan:::print.simper to see how it can be done.

R - [DESeq2] - Making DESeq Dataset object from csv of already normalized counts

I'm trying to use DESeq2's PCAPlot function in a meta-analysis of data.
Most of the files I have received are raw counts pre-normalization. I'm then running DESeq2 to normalize them, then running PCAPlot.
One of the files I received does not have raw counts or even the FASTQ files, just the data that has already been normalized by DESeq2.
How could I go about importing this data (non-integers) as a DESeqDataSet object after it has already been normalized?
Consensus in vignettes and other comments seems to be that objects can only be constructed from matrices of integers.
I was mostly concerned with getting the format the same between plots. Ultimately, I just used a workaround to get the plots looking the same via ggfortify.
If anyone is curious, I just ended up doing this. Note, the "names" file is just organized like the meta file for colData for building a DESeq object from DESeqDataSetFrom Matrix, but I changed the name of the design column from "conditions" to "group" so it would match the output of PCAplot. Should look identical.
library(ggfortify)
data<-read.csv('COUNTS.csv',sep = ",", header = TRUE, row.names = 1)
names<-read.csv("NAMES.csv")
PCA<-prcomp(t(data))
autoplot(PCA, data = names, colour = "group", size=3)

How to retrieve data using the rentrez package by giving a list of query names instead of a single one?

So I'm trying to use the rentrez package to retrieve DNA sequence data from GenBank, giving as input a list of species.
What I've done is create a vector for the species I want to query, followed by creating a term where I specify the types of sequence data I want to retrieve, then creating a search that retrieves all the occurrences that match my query, and finally I create data where I retrieve the actual sequence data in fasta file.
library(rentrez)
species<-c("Ablennes hians","Centrophryne spinulosa","Doratonotus megalepis","Entomacrodus cadenati","Katsuwonus pelamis","Lutjanus fulgens","Pagellus erythrinus")
for (x in species){
term<-paste(x,"[Organism] AND (((COI[Gene] OR CO1[Gene] OR COXI[Gene] OR COX1[Gene]) AND (500[SLEN]:3000[SLEN])) OR complete genome[All Fields] OR mitochondrial genome[All Fields])",sep='',collapse = NULL)
search<-entrez_search(db="nuccore",term=term,retmax=99999)
data<-entrez_fetch(db="nuccore",id=search$ids,rettype="fasta")
}
Basically what I'm trying to do is concatenate the results of the queries for each species into a single variable. I began using a for cycle but I see it makes no sense in this form because the data of each new species that is being queried is just replacing the previous one in data.
For some elements of species, there will be no data to retrieve and R shows this error:
Error: Vector of IDs to send to NCBI is empty, perhaps entrez_search or entrez_link found no hits?
In the cases where this error is shown and therefore there is no data for that particular species, I wanted the code to just keep going and ignore that.
My output would be a variable data which would include the sequence data retrived, from all the names in species.
library(rentrez)
species<-c("Ablennes hians","Centrophryne spinulosa","Doratonotus megalepis","Entomacrodus cadenati","Katsuwonus pelamis","Lutjanus fulgens","Pagellus erythrinus")
data <- list()
for (x in species){
term<-paste(x,"[Organism] AND (((COI[Gene] OR CO1[Gene] OR COXI[Gene] OR COX1[Gene]) AND (500[SLEN]:3000[SLEN])) OR complete genome[All Fields] OR mitochondrial genome[All Fields])",sep='',collapse = NULL)
search<-entrez_search(db="nuccore",term=term,retmax=99999)
data[x] <- tryCatch({entrez_fetch(db="nuccore",id=search$ids,rettype="fasta")},
error = function(e){NA})
}

extract tree-like data into list in R

I would like to read JSON data from the PubChem-API on Paracetamol and extract 18.1.2 ChEBI Ontology information that is stored therein (see screenshot).
I.e.: I want to get all the entires for each role (i.e. application, biological role and chemical role) in a list structure in R.
For this I get the data via the API and convert it into a R object (chebi). So far so good.
require(httr)
require(jsonlite)
require(data.tree)
# from JSON to R list
qurl = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/classification/JSON?classification_type=simple'
cid = 1983
post = POST(qurl, body = list("cid" = paste(cid, collapse = ',')))
cont_json = try(content(post, type = 'text', encoding = 'UTF-8'), silent = FALSE)
cont = fromJSON(cont_json, simplifyDataFrame = FALSE)
# subset list (i.e. get CHEBI data)
cont_l = cont$Hierarchies$Hierarchy
idx = which(sapply(cont_l, function(x) x$SourceName == 'ChEBI'))
chebi = cont_l[[idx]]
Then from the chebi object I want to retrieve the information which entries each role (i.e. application, biological role, chemical role) contains.
(1) My first idea was to simply extract the Name information. However then I loose the tree-like structure of the data and don't know what belongs to which role.
ch_node = chebi$Node
sapply(ch_node, function(x) x$Information$Name)
(2) Secondly I saw that there's the data.tree package. However I don't know how to convert the chebi object properly.
chebi_tree = as.Node(ch_node) #?
Question: How can I get the role information from the chebi object into a list in R without loosing the tree-like structure?

How to access data frames in list by name

I'm pulling large sets of data for multiple sites and views from Google Analytics for processing in R. To streamline the process, I've added my most common queries to a function (so I only have to pass the profile ID and date range). Each query is stored as a local variable in the function, and is assigned to a dynamically-named global variable:
R version 3.1.1
library(rga)
library(stargazer)
# I would add a dataset, but my data is locked down by client agreements and I don't currently have any test sites configured.
profiles <- ga$getProfiles()
website1 <- profiles[1,]
start <- "2013-01-01"
end <- "2013-12-31"
# profiles are objects containing all the ID's, accounts #, etc.; start and end specify date range as strings (e.g. "2014-01-01")
reporting <- function(profile, start, end){
id <- profile[,1] #sets profile number from profile object
#rga function for building and submitting query to API
general <- ga$getData(id,
start.date = start,
end.date = end,
metrics = "ga:sessions")
... #additional queries, structured similarly to example above(e.g. countries, cities, etc.)
#transforms name of profile object to string
profileName <- deparse(substitute(profile))
#appends "Data" to profile object name
temp <- paste(profileName, "Data", sep="")
#stores query results as list
temp2 <- list(general,countries,cities,devices,sources,keywords,pages,events)
#assigns list of query results and stores it globally
assign(temp, temp2, envir=.GlobalEnv)
}
#call reporting function at head of report or relevant section
reporting(website1,start,end)
#returns list of data frames returned by the ga$getData(...), but within the list they are named "data.frame" instead of their original query name.
#generate simple summary table with stargazer package for display within the report
stargazer(website1[1])
I'm able to access these results through *website1*Data[1], but I'm handing the data off to collaborators. Ideally, they should be able to access the data by name (e.g. *website1*Data$countries).
Is there an easier/better way to store these results, and to make accessing them easier from within an .Rmd report?
There's no real reason to do the deparse in side the function just to assign a variable in the parent environment. If you have to call the reporting() function, just have that function return a value and assign the results
reporting <- function(profile, start, end){
#... all the other code
#return results
list(general=general,countries=countries,cities=cities,
devices=devices,sources=sources,keywords=keywords,
pages=pages,events=events)
}
#store results
websiteResults <- reporting(website1,start,end)

Resources