The popular graph database Neo4j can be used within R thanks to the package/driver RNeo4j (https://github.com/nicolewhite/Rneo4j).
The package author, #NicoleWhite, provides several great examples of its usage on GitHub.
Unfortunately for me, the examples given by #NicoleWhite and the documentation are a bit oversimplistic, in that they manually create each graph node and its associated labels and properties, such as:
mugshots = createNode(graph, "Bar", name = "Mugshots", location = "Downtown")
parlor = createNode(graph, "Bar", name = "The Parlor", location = "Hyde Park")
nicole = createNode(graph, name = "Nicole", status = "Student")
addLabel(nicole, "Person")
That's all good and fine when you're dealing with a tiny example dataset, but this approach isn't feasible for something like a large social graph with thousands of users, where each user is a node (such graphs might not utilize every node in every query, but they still need to be input to Neo4j).
I'm trying to figure out how to do this using vectors or dataframes. Is there a solution, perhaps invoving an apply statement or for loop?
This basic attempt:
for (i in 1:length(df$user_id)){
paste(df$user_id[i]) = createNode(graph, "user", name = df$name[i], email = df$email[i])
}
Leads to Error: 400 Bad Request
As a first attempt, you should look at the functionality I just added for the transactional endpoint:
http://nicolewhite.github.io/RNeo4j/docs/transactions.html
library(RNeo4j)
graph = startGraph("http://localhost:7474/db/data/")
clear(graph)
data = data.frame(Origin = c("SFO", "AUS", "MCI"),
FlightNum = c(1, 2, 3),
Destination = c("PDX", "MCI", "LGA"))
query = "
MERGE (origin:Airport {name:{origin_name}})
MERGE (destination:Airport {name:{dest_name}})
CREATE (origin)<-[:ORIGIN]-(:Flight {number:{flight_num}})-[:DESTINATION]->(destination)
"
t = newTransaction(graph)
for (i in 1:nrow(data)) {
origin_name = data[i, ]$Origin
dest_name = data[i, ]$Dest
flight_num = data[i, ]$FlightNum
appendCypher(t,
query,
origin_name = origin_name,
dest_name = dest_name,
flight_num = flight_num)
}
commit(t)
cypher(graph, "MATCH (o:Airport)<-[:ORIGIN]-(f:Flight)-[:DESTINATION]->(d:Airport)
RETURN o.name, f.number, d.name")
Here, I form a Cypher query and then loop through a data frame and pass the values as parameters to the Cypher query. Your attempts right now will be slow, because you're sending a separate HTTP request for each node created. By using the transactional endpoint, you create several things under a single transaction. If your data frame is very large, I would split it up into roughly 1000 rows per transaction.
As a second attempt, you should consider using LOAD CSV in the neo4j-shell.
Related
I am trying to download sequence data from 1283 records in GenBank using rentrez. I'm using the following code, first to search for records fitting my criteria, then linking across databases, and finally fetching the sequence data:
# Search for sequence ids in biosample database
search <- entrez_search(db = "biosample",
term = "Escherichia coli[Organism] AND geo_loc_name=USA:WA[attr]",
retmax = 9999, use_history = T)
search$ids
length(search$ids)
search$web_history
#Link IDs across databases: biosample to nuccore (nucleotide sequences)
nuc_links <- entrez_link(dbfrom ="biosample",
id = search$web_history,
db ="nuccore",
by_id = T)
nuc_links$links
#Fetch nucleotide sequences
fetch_ids1 <- entrez_fetch(db = "nucleotide",
id = nuc_links$links$biosample_nuccore,
rettype = "xml")
When I do this for one single record, I am able to get the data I need. When I try to scale it up to pull data for all the sequences I need using the web history of my search, it's not working. The nuc_links$links list is NULL, which is telling me that the entrez_link is not working how I hope it will. Can anyone show me where I'm going wrong?
For species niche modeling I am trying to fetch building heights from the briljant 3D BAG data of the TU Delft https://3dbag.nl/nl/download. I want to do this for the city of Haarlem. It is possible to select and download tiles manually. But this is quite labor-intensive and prone to errors (a missing tile), and I want to repeat this action for more cities. So I am trying to use the AFS service to download features. I created a bounding box of Haarlem with a 1.2 extent for the AFS request. However, the maximum record the server delivers is 5000. Despite many alternative attempts I have failed so far to overcome this number. This is partly caused by my confusion in the AWF semantics, when I check with GetCapabilities it is hard to find out the name space, featureTypes and individual attributes (or properties). What I have tried:
Add pagination. But all the tutorials I have read so far need the number of actual/maximum features beside the server maximum (resultType = "hits"). And I was not able to easily retrieve this maximum for the limits of the bounding box.
Select tiles. I figured it should be possible to extract the tile ids that match with the bounding box, using the tile_id, an attribute from the BAG3D_v2:bag_tiles_3k layer, and then somehow build an apply or loop to extract the features per tile. But I already failed to create a cql_filter to select an individual tile.
Create tiles. Since I am not entirely sure whether the individual tiles from the 3D BAG service already exceed the 5000 feature limit, an alternative approach could be to split the bounding box is many small tiles using the R package slippymath, and then extract the features per tile. But then the challenge of filtering remains the same.
Any help with this would be appreciated. The basic code I used in many different ways:
library(httr)
url <- parse_url("https://data.3dbag.nl/api/BAG3D_v2/wfs")
url$query <- list(service = "WFS",
version = "2.0.0",
request = "GetFeature",
typename = "BAG3D_v2:lod22",
#cql_filter = "BAG3D_v2:tile_id ='4199'",
bbox = "100768.4,482708.5,107923.1,494670.4",
startindex = 10000,
sortBy = "gid")
request <- build_url(url)
test <- st_read(request)
qtm(test)
One solution is to loop over startindex 5000 by 5000. Then stop the code when the shape returned contained less than 5000 which mean it's done (unless the total number of features is a multiple of 5000...)
Below a piece of code adapted from happign package.
library(httr)
library(sf)
# function for building url
build_3DBAG_url <- function(startindex){
url <- parse_url("https://data.3dbag.nl/api/BAG3D_v2/wfs")
url$query <- list(service = "WFS",
version = "2.0.0",
request = "GetFeature",
typename = "BAG3D_v2:lod22",
#cql_filter = "BAG3D_v2:tile_id ='4199'",
bbox = "100768.4,482708.5,107923.1,494670.4",
startindex = startindex,
count = 5000,
sortBy = "gid")
url <- build_url(url)
return(url)
}
# initialize first request
resp <- read_sf(build_3DBAG_url(startindex = 0))
message("Features downloaded : ", nrow(resp), appendLF = F)
# loop until returned shape is less than 5000
i <- 5000
temp <- resp
while(nrow(temp) == 5000){
message("...", appendLF = F)
temp <- read_sf(build_3DBAG_url(startindex = i))
resp <- rbind(resp, temp)
message(nrow(resp), appendLF = F)
i <- i + 5000
}
I have a dataset with ~10,000 species. For each species in the dataset I want to query the IUCN database for threats facing each species. I can do this with one species at a time using the rl_threats function from the package rredlist. Below is an example of the function, this example pulls the threats facing Fratercula arctica and assigns them to the object test1 (key is a string that serves as a password for using the IUCN API that stays constant, parse should be TRUE but not as important).
test1<-rl_threats(name="Fratercula arctica",
key = '1234',
parse = TRUE)
I want to get threats for all 10,000 species in my dataset. My idea is to use a loop that passes in the names from my dataset into the name=" " field in the rl_threats command. This is a basic loop I tried to construct to do this but I'm getting lots of errors:
for (i in 1:df$scientific_name) {
rl_threats(name=i,
key = '1234',
parse = TRUE)
}
How would I pass the species names from the scientific_name column into the rl_threats function such that R would loop through and pull threats for every species?
Thank you.
You can create a list to store the output.
result <- vector('list', length(df$scientific_name))
for (i in df$scientific_name) {
result[[i]] <- rl_threats(name=i, key = '1234', parse = TRUE)
}
You can also use lapply :
result <- lapply(df$scientific_name, function(x) rl_threats(name=x, key = '1234', parse = TRUE))
I would like to extract a specific set of productskus from Google Analtics with several metrics included. My set of sku that I would like to extract are in a list. I cannot seem to get Analytics to do what I need it to do.
I have been trying to see how to filter on a list. The most common answer that I am able to find on how to use the dim_filter is from this website:
https://www.rdocumentation.org/packages/googleAnalyticsR/versions/0.7.0/topics/dim_filter
I have tried multiple ways to get my answer and am always getting errors at many different parts of the code.
stDate <- "2019-07-18"
endDate <- "2019-09-30"
x <- list(BC$sku)
#Get all of the info from Analytics for products
b <- google_analytics(ga_id,
date_range = c(stDate, endDate),
metrics = c("itemQuantity", "itemRevenue", "productDetailViews"),
dimensions = c("productSku"),
dim_filter = x,
anti_sample = TRUE)
The above code gives me the following error:
Error in as(dim_filters, ".filter_clauses_ga4") :
no method or default for coercing “list” to “.filter_clauses_ga4”
I am not able to get any output from this code as the filter is not working.
I can of course, query the entire dataset, but that becomes cumbersome very fast as I would like to be able to query the Google Analytics API with a specific set of skus anytime that I would like.
You need to construct your filter object more to handle the list of SKUs. As specified in the official googleAnalyticsR documentation dim_filters are created with the dim_filter() function and filter_clause_ga4()
You can send a list of dim_filter() in, or try use the "IN_LIST" operator and send in your character vector (if its not too big)
In the latter case the final code would look something like:
stDate <- "2019-07-18"
endDate <- "2019-09-30"
# if small enough list
dim_filters <- list(dim_filter("product_sku", "IN_LIST", BC$sku))
#Get all of the info from Analytics for products
b <- google_analytics(ga_id,
date_range = c(stDate, endDate),
metrics = c("itemQuantity", "itemRevenue", "productDetailViews"),
dimensions = c("productSku"),
dim_filter = filter_clause_ga4(dim_filters),
anti_sample = TRUE)
# this may work as well to construct the filter
dim_filters <- lapply(BC$sku,
function(x) {
dim_filter("product_sku" ,operator = "EXACT", expressions = x)
})
I would like to read JSON data from the PubChem-API on Paracetamol and extract 18.1.2 ChEBI Ontology information that is stored therein (see screenshot).
I.e.: I want to get all the entires for each role (i.e. application, biological role and chemical role) in a list structure in R.
For this I get the data via the API and convert it into a R object (chebi). So far so good.
require(httr)
require(jsonlite)
require(data.tree)
# from JSON to R list
qurl = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/classification/JSON?classification_type=simple'
cid = 1983
post = POST(qurl, body = list("cid" = paste(cid, collapse = ',')))
cont_json = try(content(post, type = 'text', encoding = 'UTF-8'), silent = FALSE)
cont = fromJSON(cont_json, simplifyDataFrame = FALSE)
# subset list (i.e. get CHEBI data)
cont_l = cont$Hierarchies$Hierarchy
idx = which(sapply(cont_l, function(x) x$SourceName == 'ChEBI'))
chebi = cont_l[[idx]]
Then from the chebi object I want to retrieve the information which entries each role (i.e. application, biological role, chemical role) contains.
(1) My first idea was to simply extract the Name information. However then I loose the tree-like structure of the data and don't know what belongs to which role.
ch_node = chebi$Node
sapply(ch_node, function(x) x$Information$Name)
(2) Secondly I saw that there's the data.tree package. However I don't know how to convert the chebi object properly.
chebi_tree = as.Node(ch_node) #?
Question: How can I get the role information from the chebi object into a list in R without loosing the tree-like structure?