Split Neo4j Cypher query into smaller queries - r

So I'm trying to extract some data from my Neo4j database to a file using R
This is what the code looks like :
library('bitops')
library('RCurl')
library('RJSONIO')
query <- function(querystring) {
h = basicTextGatherer()
curlPerform(url="localhost:7474/db/data/cypher",
postfields=paste('query',curlEscape(querystring), sep='='),
writefunction = h$update,
verbose = FALSE
)
result <- fromJSON(h$value())
#print(result)
data <- data.frame(t(sapply(result$data, unlist)))
print(data)
names(data) <- result$columns
data
}
q <-"MATCH (n:`layer_1_SB`)-[r]-> (m) WHERE m:layer_1_SB RETURN n.userid, m.userid LIMIT 18000000"
data <- query(q)
head(data)
dim(data)
names(data)
write.table(data, file = "/home/dataminer/data1.dat", append=FALSE,quote=FALSE,sep=" ",eol="\n", na="NA", dec=".", row.names=FALSE)
And it works fine, returning around 147k relationships. However when I make the same quer between two different labels (layer_1 to layer_2) which should return around 18million relationships, the program loads for a while and then returns NULL. When doing the same query and returning the count on the Neo4j browser it works, so I'm assuming the problem has to do with the amount of data that R can handle.
The question is: How can I split my query into smaller queries so that my code works?
UPDATE
I tried doing a query with 10million rels and it worked. So now I want to useWITH and ORDER BY to return the first then the last relationships. However it's returning NULL, I believe my query is badly formatted:
MATCH (n:'layer_1_SB')-[r]-> (m) WITH n ORDER BY n.userid DESC WHERE m:layer_2_SB RETURN n.userid, m.userid LIMIT 8000000

You should use the transactional endpoint instead or at least pass the header X-Stream:true.
Both stream data from the server so it doesn't eat up its memory.

Related

wildcards to download particular csv

Still very new to R, so please excuse me.
I am trying to download csv data from the Sloane Digital Website Survey. Within R I do the following -
astro1 <- read.csv("https://dr14.sdss.org/optical/spectrum/view/data/format=csv/spec=full?mjd=55359&fiberid=596&plateid=4055")
This downloads 1 csv spectra per fibre ID per plate [here, plateid=4055]. However, if there are several hundred fibre IDs it will be a very long couple of days.
Is there a way to batch download all csv data for all fibre IDs? I tried fibreid=* (and "", " ", #, but got the following error -
"no lines available in input", or unexpected string constant.
If for example there are 100 .csv files per plate. All will have a common x-axis (wavelength), but a different 3rd column (best fit, for y-axis). Is there a way to get the downloaded csv tables to form 1 very large dataset, with the same common axis (wavelength), and subsequent columns to show only the Best Fit columns?
Many thx
Best case would be that you have a list of all the links to your wanted csv-Files. Since this is seemingly not the case, you know that you want to loop over all the fiberids. You know the structure of the link, hence we could use it to define
buildFibreIdLink <- function(fibreId) {
paste0("https://dr14.sdss.org/optical/spectrum/view/data/format=csv/spec=full?mjd=55359&fiberid=",fibreId,"&plateid=4055")
}
Now I would just loop over all ids, whatever all means in this case. Just start at 1 and count up. Therefore I would use the function
getCsvDataList <- function(startId = 1, endId = 10, maxConsecutiveNulls = 5) {
dataList <- list()
consecutiveNullCount <- 0
for(id in startId:endId) {
csvLink <- buildFibreIdLink(fibreId = id)
newData <- tryCatch(expr = {
read.csv(csvLink)
}, error = function(e) {return(NULL)})
if(is.null(newData)) {
consecutiveNullCount <- consecutiveNullCount +1
} else {
dataList <- c(dataList,list(newData))
consecutiveNullCount <- 0
}
if(consecutiveNullCount == maxConsecutiveNulls) {
print(paste0("reached maxConsecutiveNulls at id ",id))
break;
}
}
return(dataList)
}
Specify the id-range you want to read, such that you can really partially read the csvs. Now the question is: When have you reached the end? My answer would basically be: You reached the end, when there are maxConsecutiveNulls consecutive "read-csv-fails". I assume that a link doesn't exist if you can't read it, hence the tryCatch-block triggers and I basically count these triggers until a given maximum.
If you know that the structure of the csvs is always the same, you can merge the list of data.frames together via
dataListFrom1to10 <- getCsvDataList(startId = 1, endId = 10)
merged1to10 <- do.call("rbind",dataListFrom1to10)
Update: If you have your vector of needed fibre-ids, you can modify the function as follows. Since we didn't know the exact Ids, we looped from 1 to anywhere. Now, knowing the Ids, you can replace the startId and endId arguments by say fibreIdVector, to get the signature
getCsvDataList <- function(fibreIdVector, maxConsecutiveNulls) .... In the for-loop, replace for(id in startId:endId) by for(id in fibreIdVector). If you know that all your Ids are valid, you can remove the error-handling to get a much cleaner function. Since you don't need to know the results of previous iterations, e.g. counting the consecutiveNullCount, you can just put everything into an lapply like
allCsvData <- lapply(fibreIdVector, function(id) {
read.csv(buildFibreIdLink(fibreId = id))
})
replacing the whole function.

R: how to query from a database multiple times based on different dates

I have an R Markup file in which I establish a database connection, query data, and store the data in an csv file. The query is based on a specific date range. How can I automated make multiple queries, so that one after another e.g. every week is queried from the database? I cannot make a query for e.g. the whole year, but I need to store the data separately for each week. I could make a data frame, in which I have two columns for the start and end date, which I would like to use for the query.
But how can I automatically run the queries multiple times depending on the date data frame?
My code so far:
#load libraries
drv <- PostgreSQL()
db_con <- dbConnect(drv, host=my_host, user=my_user, dbname=my_name, port=my_port, password=my_password)
start = "2015-01-01"
end = "2015-01-02"
result <- dbGetQuery(
db_con,
"SELECT * FROM table WHERE date >= start AND date <= end;")
st_write(result, pathname)
Consider parameterization using DBI::sqlInterpolate with a Map (wrapper to mapply) iteration:
db_con <- dbConnect(
PostgreSQL(), host=my_host, user=my_user, dbname=my_name,
port=my_port, password=my_password)
)
# ALL WEEKLY DATES IN 2015
dates_df <- data.frame(
start=seq.Date(as.Date("2015-01-01"), as.Date("2016-01-01"), by="week"),
end=seq.Date(as.Date("2015-01-08"), as.Date("2016-01-08"), by="week")
)
# USER DEFINED METHOD TO QUERY AND WRITE DATA
query_db <- function(s, e) {
# PREPARED STATEMENT WITH PLACEHOLDERS
sql <- "SELECT * FROM table WHERE date >= ?start AND date <= ?end;"
# BIND PARAMETERS AND QUERY
stmt <- DBI::sqlInterpolate(db_con, sql, start=s, end=e)
result <- dbGetQuery(db_con, stmt)
# WRITE DATA TO DISK
st_write(result, pathname)
# RETURN QUERY RESULTSET
return(result)
}
# WRITE AND STORE DATA IN MEMORY
df_list <- Map(query_db, dates_df$start, dates_df$end)
I suggest DBI::dbBind and a frame (similar to #Parfait's answer).
For demonstration, I have a "sessions" table on my pg instance that has a field ScheduledStart. In this case, it's a TIMESTAMPTZ column, not a Date column, so I need to take one more step in my demo (convert from R's Date to POSIXt classes).
# pg <- DBI::dbConnect(...)
ranges <- data.frame(
start = seq(as.Date("2020-03-01"), length.out = 4, by = "week")
)
ranges$end <- ranges$start + 6
# this line is only necessary because of my local table
ranges[] <- lapply(ranges, as.POSIXct)
Here is the bulk of the "query multiple weeks":
res <- DBI::dbSendQuery(pg, "select count(*) as n from Sessions where ScheduledStart between ? and ?")
DBI::dbBind(res, ranges)
out <- DBI::dbFetch(res)
DBI::dbClearResult(res)
out
# n
# 1 8
# 2 1
# 3 0
# 4 0
While sqlInterpolate is much better than forming your own query strings (e.g., with sprintf or paste), using dbBind allows for internal iteration like above, and allows the DBMS to optimize the query with the binding parameter ? instead of actual data. (Using sqlInterpolate, the DBMS would see four different queries. Using dbBind, it sees one query, optimizes it, and uses it four times.)
That query was really boring (select * ... works, too), but I think it gets the point across. The only downside of this method is that while it makes it really easy to get all of the data, there is nothing here that inherently tells you which of your queries a particular row came from. I suspect that you can determine that from your data, that your main intent on breaking it down by-week is the amount retrieved per-query.
Side note: I often use code like this in functions, where it is feasible that something between dbSendQuery and dbClearResult might interrupt operation. In that case, I tend to reorder my code a little, like this:
somefunc <- function(...) {
# ...
res <- DBI::dbSendQuery(pg, "select count(*) as n from Sessions where ScheduledStart between ? and ?")
on.exit({
suppressWarnings(DBI::dbClearResult(res))
}, add = TRUE)
DBI::dbBind(res, ranges)
DBI::dbClearResult(res)
return(out)
}

Pass strings from a column into a function by using a loop - R

I have a dataset with ~10,000 species. For each species in the dataset I want to query the IUCN database for threats facing each species. I can do this with one species at a time using the rl_threats function from the package rredlist. Below is an example of the function, this example pulls the threats facing Fratercula arctica and assigns them to the object test1 (key is a string that serves as a password for using the IUCN API that stays constant, parse should be TRUE but not as important).
test1<-rl_threats(name="Fratercula arctica",
key = '1234',
parse = TRUE)
I want to get threats for all 10,000 species in my dataset. My idea is to use a loop that passes in the names from my dataset into the name=" " field in the rl_threats command. This is a basic loop I tried to construct to do this but I'm getting lots of errors:
for (i in 1:df$scientific_name) {
rl_threats(name=i,
key = '1234',
parse = TRUE)
}
How would I pass the species names from the scientific_name column into the rl_threats function such that R would loop through and pull threats for every species?
Thank you.
You can create a list to store the output.
result <- vector('list', length(df$scientific_name))
for (i in df$scientific_name) {
result[[i]] <- rl_threats(name=i, key = '1234', parse = TRUE)
}
You can also use lapply :
result <- lapply(df$scientific_name, function(x) rl_threats(name=x, key = '1234', parse = TRUE))

How to force Hive to distribute data equally among different reducers?

Imagine I want to send the Iris dataset, that I have as a Hive table, to different reducers in order to run the same task in parallel on R. I can execute my R script through the transform function and use lateral view explode on hive to do a cartesian product on the iris dataset and an array containing my "partition" variable, like on the query below:
set source_table = iris;
set x_column_names = "sepallenght|sepalwidth|petallength|petalwidth";
set y_column_name = "species";
set output_dir = "/r_output";
set model_name ="paralelism_test";
set param_var = params;
set param_array = array(1,2,3);
set mapreduce.job.reduces=3;
select transform(id, sepallenght, sepalwidth, petallength, petalwidth, species, ${hiveconf:param_var})
using 'controlScript script.R ${hiveconf:x_column_names}${hiveconf:y_column_name}${hiveconf:output_dir}${hiveconf:model_name}${hiveconf:param_var}'
as (script_result string)
from
(select *
from ${hiveconf:source_table}
lateral view explode ( ${hiveconf:param_array} ) temp_table
as ${hiveconf:param_var}
distribute by ${hiveconf:param_var}
) data_query;
I call a memory control script, so please ignore it for the sake of objectivity.
What my script.R returns are a list of the unique parameters it has received (the "params" column populated with the "param_var" array values) and the number of rows the partition it gets has, as follows:
#The aim of this script is to validate the paralel computation of R scripts through Hive.
compute_model <- function(data){
paste("parameter ",unique(data[ncol(data)]), ", " , nrow(data), "lines")
}
main <- function(args){
#Reading the input parameters
#These inputs were passed along the transform's "using" clause, on Hive.
x_column_names <- as.character(unlist(strsplit(gsub(' ','',args[1]),'\\|')))
y_column_name <- as.character(args[2])
target_dir <- as.character(args[3])
model_name <- as.character(args[4])
param_var_name <- as.character(args[5])
#Reading the data table
f <- file("stdin")
open(f)
data <- tryCatch({
as.data.frame (
read.table(f, header=FALSE, sep='\t', stringsAsFactors = T, dec='.')
)},
warning = function(w) cat(w),
error = function(e) stop(e),
finally = close(f)
)
#Computes the model. Here, the model can be any computation.
instance_result <- as.character(compute_model(data))
#writes the result to "stdout" separated by '\t'. This output must be a data frame where
#each column represents a Hive Table column.
write.table(instance_result,
quote = FALSE,
row.names = FALSE,
col.names = FALSE,
sep = "\t",
dec='.'
)
}
#Main code
###############################################################
main(commandArgs(trailingOnly=TRUE))
What I want hive to do is replicate the Iris dataset equally among these reducers. It works fine when i put sequential values on my param_array variable, but for values like array(10, 100, 1000, 10000) and mapreduce.job.reduces=4, or array(-5,-4,-3,-2,-1,0,1,2,3,4,5) and mapreduce.job.reduces=11, some reducers won't receive any data, and others will receive more than one key.
The question is: is there a way to make sure hive distributes each partition to a different reducer?
Did I make myself clear?
It may look silly to do it, but I want to run grid search on Hadoop and have some restrictions on using other technologies that are more suitable to this task.
Thank you!

rmongodb is very slow in creating data.frame

I am using MongoDB do to tick data analysis in R. Initially I used MySQL, which worked fine, but I wanted to test MongoDB for this purpose. The data set contains about 200 million entries at the moment. Using RODBC I could get the query result into a data.frame very quickly using sqlQuery(conn, "select * from td where prd = 'TY' and date = '2012-01-03'")
In MongoDB I have documents like Document{{_id=5537ca647a3ad42a84374f0a, prd=TY, time=1325661600043, px=130.6875, sz=11}}
In Java I can retrieve a days worth of tick data - roughly 100,000 entries, create Tick objects and add them to an array, all in less than 2 seconds.
Using rmongodb, the below takes forever. Any ideas how to improve this?
query <- mongo.bson.from.list( list(product = "TY", date = as.POSIXct("2012-01-04")) )
res.cursor <- mongo.find(mongo, db.coll, query, limit = 100e3, options=mongo.find.exhaust)
resdf <- mongo.cursor.to.data.frame(res.cursor)
Using find.all is equally slow.

Resources