Neo.DatabaseError.General.UnknownError GC overhead limit exceeded in R 10.12.1 - r

Totally new to neo4j, I was running the csv file when this issue occurred, how can I fix this? thanks so much!!
library("RNeo4j")
library("curl")
graph <- startGraph("http://localhost:7474/db/data", username = "neo4j", password = "")
clear(graph, input = F)
query <- "LOAD CSV WITH HEADERS FROM {csv} AS row CREATE (n:flights {year: row.year, month: row.mo, dep_time: row.dep_time, arr_time: row.arr_time, carrier: row.carrier, tailnum: row.tailnum, flight: row.flight, origin: row.origin, dest: row.dest, air_time: row.air_time, distance: row.distance, hour: row.hour, minute: row.minute })
cypher(graph, query, csv = "file:///flights1/flights.csv")
Error: Client error: (400) Bad Request
Neo.DatabaseError.General.UnknownError
GC overhead limit exceeded

Related

count bytes with influxs telegraf

I can receive messages with the inputs.mqtt_consumer telegraf plugin, but it gives me a lot of data in influxdb.
How can I in the telegraf configuration just count the number of received bytes and messages and report that to influx db?
# Configuration for telegraf agent
[agent]
interval = "20s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = ""
hostname = ""
omit_hostname = false
[[outputs.influxdb_v2]]
urls = ["XXXXXXXXXXXXXXXX"]
token = "$INFLUX_TOKEN"
organization = "XXXXXXXXXXXXXXX"
bucket = "XXXXXXXXXXXXXXX"
[[inputs.mqtt_consumer]]
servers = ["tcp://XXXXXXXXXXXXXXXXXXXXX:1883"]
topics = [
"#",
]
data_format = "value"
data_type = "string"
I tried to google around but din't find any clear ways to do it.
I just want number of bytes and messages received each minute for the selected topic
I did not manage to receive all the messages and count them, but I found a solution where I can get the data from the broker. Not exactly what I asked for but fine for what I need.
topics = [
"$SYS/broker/load/messages/received/1min",
"$SYS/broker/load/messages/sent/1min",
]
...
data_format = "value"
data_type = "float"

I'm trying to get some tweets with academictwitteR, but the code points to an error with endpoint_url

I'm trying to get some tweets with academictwitteR, but the code throws the following error:
tweets_espn <- get_all_tweets( query = "fluminense",
+ user = "ESPNBrasil",
+ start_tweets = "2020-01-01T00: 00: 00Z " ,
+ end_tweets = "2020-31-12T00 : 00: 00Z " ,
+ n = 10000)
query: fluminense (from:ESPNBrasil) Error in make_query(url =
endpoint_url, params = params, bearer_token = bearer_token, :
something went wrong. Status code: 403 In addition: Warning messages:
1: Recommended to specify a data path in order to mitigate data loss
when ingesting large amounts of data. 2: Tweets will not be stored as
JSONs or as a .rds file and will only be available in local memory if
assigned to an object.
it seems to me that you can only access the Twitter API via academictwitteR if you have been awarded the "academic research" access from the Twitter developer portal. So i dont think it works with the essential or elevated access.

Airflow connect to sql server select results to a data frame

Airflow-pandas-read-sql-query to dataframe
i am trying to connect to SQL server local to get data from a table and process the data using pandas operations but i m failing to figure out how to pass the select query results to a data frame
the below works to clear data in the table
``` sql_command = """ DELETE FROM [TestDB].[dbo].[PythonTestData] """
t3 = MsSqlOperator( task_id = 'run_test_proc',
mssql_conn_id = 'mssql_local',
sql = sql_command,
dag = dag,
database = 'TestDB',
autocommit = True) ```
the intended pandas is
query = 'SELECT * FROM [ClientData] '#where product_name='''+i+''''''
df = pd.read_sql(query, conn)
pn_list = df['ClientID'].tolist()
#print("The original pn_list is : " + str(pn_list))
for i in pn_list:
varw= str(i)
queryw = 'SELECT * FROM [ClientData] where [ClientID]='''+varw+''
dfw = pd.read_sql(queryw, conn)
dfw = dfw.applymap(str)
cols=['product_id','product_name','brand_id']
x=dfw.values.tolist()
x=x[0]
ClientID=x[0]
Name=x[1]
Org=x[2]
Email=x[3]
#print('Name :'+Name+' ,'+'Org :'+Org+' ,'+'Email :'+Email+' ,'+'ClientID :'+ClientID)
salesData_qry= 'SELECT * FROM [TestDB].[dbo].[SalesData] where [ClientID]='''+ClientID+''
salesData_df= pd.read_sql(salesData_qry, conn)
salesData_df['year1'] = salesData_df['Order Date'].dt.strftime('%Y')
salesData_df['OrderMonth'] = salesData_df['Order Date'].dt.strftime('%b')
filename ='Daily_Campaign_Report_'+Name+'_'+Org+'_'+datetime.now().strftime("%Y%m%d_%H%M%S")
p = Path('C:/Users/user/Documents/WorkingData/')
salesData_df.to_csv(Path(p, filename + '.csv'))```
Please point me to correct approach as i m new to airflow
I'm not so clear on how you generate the query code but in order to get dataframe from MsSQL you need to use MsSqlHook:
from airflow.providers.microsoft.mssql.hooks.mssql import MsSqlHook
def mssql_func(**kwargs):
hook = MsSqlHook(conn_id='mssql_local')
df = hook.get_pandas_df(sql="YOUR_QUERY")
#do whatever you need on the df
run_this = PythonOperator(
task_id='mssql_task',
python_callable=mssql_func,
dag=dag
)
this is the code i am using for the dag
def mssql_func(**kwargs):
conn = MsSqlHook.get_connection(conn_id="mssql_local")
hook = conn.get_hook()
df = hook.get_pandas_df(sql="SELECT * FROM [TestDB].[dbo].[ClientData]")
#do whatever you need on the df
print(df)
run_this = PythonOperator(
task_id='mssql_task',
python_callable=mssql_func,
dag=dag
)
Error Log
[2021-01-12 16:07:15,114] {providers_manager.py:159} WARNING - The provider for package 'apache-airflow-providers-imap' could not be registered from because providers for that package name have already been registered
[2021-01-12 16:07:15,618] {base.py:65} INFO - Using connection to: id: mssql_local. Host: localhost, Port: 1433, Schema: dbo, Login: sa, Password: XXXXXXXX, extra: None
[2021-01-12 16:07:15,626] {taskinstance.py:1396} ERROR - (18456, b"Login failed for user 'sa'.DB-Lib error message 20018, severity 14:\nGeneral SQL Server error: Check messages from the SQL Server\nDB-Lib error message 20002, severity 9:\nAdaptive Server connection failed (localhost)\nDB-Lib error message 20002, severity 9:\nAdaptive Server connection failed (localhost)\n")
Traceback (most recent call last):
File "src/pymssql.pyx", line 636, in pymssql.connect
File "src/_mssql.pyx", line 1964, in _mssql.connect
File "src/_mssql.pyx", line 682, in _mssql.MSSQLConnection.__init__
File "src/_mssql.pyx", line 1690, in _mssql.maybe_raise_MSSQLDatabaseException
_mssql.MSSQLDatabaseException: (18456, b"Login failed for user 'sa'.DB-Lib error message 20018, severity 14:\nGeneral SQL Server error: Check messages from the SQL Server\nDB-Lib error message 20002, severity 9:\nAdaptive Server connection failed (localhost)\nDB-Lib error message 20002, severity 9:\nAdaptive Server connection failed (localhost)\n")

rentrez entrez_summary premature EOF

Trying to move on from my troubles with RISmed (see Problems with RISmed and large(ish) data sets), I decided to use rentrez and entrez_summary to retrieve a large list of pubmed titles from a query:
set_entrez_key("######") #I did provide my real API key here
Sys.getenv("ENTREZ_KEY")
rm(list=ls())
library(rentrez)
query="(United States[AD] AND France[AD] AND 1995:2020[PDAT])"
results<-entrez_search(db="pubmed",term=query,use_history=TRUE)
results
results$web_history
for (seq_start in seq(0, results$count, 100)) {
if (seq_start == 0) {
summary.append.l <- entrez_summary(
db = "pubmed",
web_history = results$web_history,
retmax = 100,
retstart = seq_start
)
}
Sys.sleep(0.1) #slow things down in case THAT'S a factor here....
summary.append.l <- append(
summary.append.l,
entrez_summary(
db = "pubmed",
web_history = results$web_history,
retmax = 100,
retstart = seq_start
)
)
}
The good news...i didn't get a flat out rejection from NCBI like i did with RISMed and EUtilsGet. The bad news...it's not completing. (I get either
Error in curl::curl_fetch_memory(url, handle = handle) :
transfer closed with outstanding read data remaining
or
Error: parse error: premature EOF
(right here) ------^
I almost think there's something about using an affiliation search string in the query, because if I change the query to
query="monoclonal[Title] AND antibody[Title] AND 2010:2020[PDAT]"
it completes the run, despite having about the same number of records to deal with. So...any ideas why a particular search string would result in problems with the NCBI servers?

How can I cut large csv files using any R packages like ff or data.table?

I want to cut large csv files (file size more than RAM size) and use them or save each in disk for later usage. Which R package is best for doing this for large files?
I haven't tried but using skip and nrows parameters in read.table or read.csv is worth a try. These are from ?read.table
skip integer: the number of lines of the data file to skip before
beginning to read data.
nrows integer: the maximum number of rows to read in. Negative and
other invalid values are ignored.
To avoid some troublesome issues at the end you need to do some error handling. In other words I don't know what happpens when skip value is greater than the number of rows in your big csv.
p.s. I also don't know whether header=TRUE is affecting skip or not, you also have to check that.
The answer given bu #berkorbay is OK and I can confirm that header can be used with skip. However, if your file is really large it gets painfully slow, as each subsequent reading after the first must skip over all previously read lines.
I had to do something similar and, after wasting quite a bit of time, I wrote a short script in PERL which fragments the original file in chuncks that you can read one after the other. It is much faster. I enclose the source here, translating some parts so that the intent is clear:
#!/usr/bin/perl
system("cls");
print("Fragment .csv file keeping header in each chunk\n") ;
print("\nEnter input file name = ") ;
$entrada = <STDIN> ;
print("\nEnter maximum number of lines in each fragment = ") ;
$nlineas = <STDIN> ;
print("\nEnter output file name stem = ") ;
$salida = <STDIN> ;
chop($salida) ;
open(IN,$entrada) || die "Cannot open input file: $!\n" ;
$cabecera = <IN> ;
$leidas = 0 ;
$fragmento = 1 ;
$fichero = $salida.$fragmento ;
open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
print OUT $cabecera ;
while(<IN>) {
if ($leidas > $nlineas) {
close(OUT) ;
$fragmento++ ;
$fichero = $salida.$fragmento ;
open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
print OUT $cabecera ;
$leidas = 0;
}
$leidas++ ;
print OUT $_ ;
}
close(OUT) ;
Just save with whatever name and execute. The first line might have to be changed if you have PERL in a diferent place (an, if you are on Windows, you migh have to invoke the script as "perl name-of-script").
One should have used read.csv.ffdf of ff package with specific parameters like this to read big file:
library(ff)
a <- read.csv.ffdf(file="big.csv", header=TRUE, VERBOSE=TRUE, first.rows=1000000, next.rows=1000000, colClasses=NA)
Once big file is read into a ff object, Subsetting ffobject into data frames can be done using:
a[1000:1000000,]
Rest of the code for subsetting and saving broken dataframes
totalrows = dim(a)[1]
row.size = as.integer(object.size(a[1:10000,])) / 10000 #in bytes
block.size = 200000000 #in bytes .IN Mbs 200 Mb
#rows.block is rows per block
rows.block = ceiling(block.size/row.size)
#nmaps is the number of chunks/maps of big dataframe(ff), nmaps = number of maps - 1
nmaps = floor(totalrows/rows.block)
for(i in (0:nmaps)){
if(i==nmaps){
df = a[(i*rows.block+1) : totalrows,]
}
else{
df = a[(i*rows.block+1) : ((i+1)*rows.block),]
}
#process df or save it
write.csv(df,paste0("M",i+1,".csv"))
#remove df
rm(df)
}
Alternatively you can first read the files into mysql using dbWriteTable and then use read.dbi.ffdf function from the ETLUtils package to read it back to R. Consider the function below;
read.csv.sql.ffdf <- function(file, name,overwrite = TRUE, header = TRUE, drv = MySQL(), dbname = "new", username = "root",host='localhost', password = "1234"){
conn = dbConnect(drv, user = username, password = password, host = host, dbname = dbname)
dbWriteTable(conn, name, file, header = header, overwrite = overwrite)
on.exit(dbRemoveTable(conn, name))
command = paste0("select * from ", name)
ret = read.dbi.ffdf(command, dbConnect.args = list(drv =drv, dbname = dbname, username = username, password = password))
return(ret)
}

Resources