Exception importing data into neo4j using batch-import - graph

I am running neo-4j 1.8.2 on a remote unix box. I am using this jar (https://github.com/jexp/batch-import/downloads).
nodes.csv is same as given in example:
name age works_on
Michael 37 neo4j
Selina 14
Rana 6
Selma 4
rels.csv is like this:
start end type since counter:int
1 2 FATHER_OF 1998-07-10 1
1 3 FATHER_OF 2007-09-15 2
1 4 FATHER_OF 2008-05-03 3
3 4 SISTER_OF 2008-05-03 5
2 3 SISTER_OF 2007-09-15 7
But i am getting this exception :
Using Existing Configuration File
Total import time: 0 seconds
Exception in thread "main" java.util.NoSuchElementException
at java.util.StringTokenizer.nextToken(StringTokenizer.java:332)
at org.neo4j.batchimport.Importer$Data.split(Importer.java:156)
at org.neo4j.batchimport.Importer$Data.update(Importer.java:167)
at org.neo4j.batchimport.Importer.importNodes(Importer.java:226)
at org.neo4j.batchimport.Importer.main(Importer.java:83)
I am new to neo4j, was checking if this importer can save some coding effort.
It would be great if someone can point to the probable mistake.
Thanks for help!
--Edit:--
My nodes.csv
name dob city state s_id balance desc mgr_primary mgr_secondary mgr_tertiary mgr_name mgr_status
John Von 8/11/1928 Denver CO 1114-010 7.5 RA 0023-0990 0100-0110 Doozman Keith Active
my rels.csv
start end type since status f_type f_num
2 1 address_of
1 3 has_account 5 Active
4 3 f_of Primary 0111-0230

Hi I had some issues in the past with the batch import script.
The formating of your file must be very rigorous, which means :
no extra spaces where not expected, like the ones I see in the first line of your rels.csv before "start"
no multiple spaces in place of the tab. If your files are exactly like what you've copied here, you have 4 spaces instead of on tab, and this is not going to work, as the script uses a tokenizer looking for tabs !!!
I had this issue because I always convert tabs to 4 spaces, and once I understood that, I stopped doing it for my csv !

Related

Webscraping RequestGet from Airbnb not working properly

This query is returning 0 or 20 randomly every time i run it. Yesterday when i loop through the pages i always get 20 and I am able to scrape through 20 listings and 15 pages. But now, I can't run my code properly because sometimes the listings return 0.
I tried adding headers in the request get and time sleep (5-10s random) before each request but am still facing the same issue. Tried connecting to hotspot to change my IP but am still facing the same issue. Anyone understand why?
import time
from random import randint
from bs4 import BeautifulSoup
import requests #to connect to url
airbnb_url = 'https://www.airbnb.com/s/Mayrhofen--Austria/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&date_picker_type=calendar&query=Mayrhofen%2C%20Austria&place_id=ChIJbzLYLzjdd0cRDtGuTzM_vt4&checkin=2021-02-06&checkout=2021-02-13&adults=4&source=structured_search_input_header&search_type=autocomplete_click'
soup = BeautifulSoup(requests.get(airbnb_url).content, 'html.parser')
listings = soup.find_all('div', '_8s3ctt')
print(len(listings))
It seems AirBnB returns 2 versions of the page. One "normal" HTML and other where the listings are stored inside <script>. To parse the <script> version of page you can use next example:
import json
import requests
from bs4 import BeautifulSoup
def find_listing(d):
if isinstance(d, dict):
if "__typename" in d and d["__typename"] == "DoraListingItem":
yield d["listing"]
else:
for v in d.values():
yield from find_listing(v)
elif isinstance(d, list):
for v in d:
yield from find_listing(v)
airbnb_url = "https://www.airbnb.com/s/Mayrhofen--Austria/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&date_picker_type=calendar&query=Mayrhofen%2C%20Austria&place_id=ChIJbzLYLzjdd0cRDtGuTzM_vt4&checkin=2021-02-06&checkout=2021-02-13&adults=4&source=structured_search_input_header&search_type=autocomplete_click"
soup = BeautifulSoup(requests.get(airbnb_url).content, "html.parser")
listings = soup.find_all("div", "_8s3ctt")
if len(listings):
# normal page:
print(len(listings))
else:
# page that has listings stored inside <script>:
data = json.loads(soup.select_one("#data-deferred-state").contents[0])
for i, l in enumerate(find_listing(data), 1):
print(i, l["name"])
Prints (when returned the <script> version):
1 Mariandl (MHO103) for 36 persons.
2 central and friendly! For Families and Friends
3 Sonnenheim for 5 persons.
4 MO's Apartments
5 MO's Apartments
6 Beautiful home in Mayrhofen with 3 Bedrooms
7 Quaint Apartment in Finkenberg near Ski Lift
8 Apartment 2 Villa Daringer (5 pax.)
9 Modern Apartment in Schwendau with Garden
10 Holiday flats Dornau, Mayrhofen
11 Maple View
12 Laubichl Lodge by Apart Hotel Therese
13 Haus Julia - Apartment Edelweiß Mayrhofen
14 Melcherhof,
15 Rest coke
16 Vacation home Traudl
17 Luxurious Apartment near Four Ski Lifts in Mayrhofen
18 Apartment 2 60m² for 2-4 persons "Binder"
19 Apart ZEMMGRUND, 4-9 persons in Mayrhofen/Tirol
20 Apartment Ahorn View
EDIT: To print lat, lng:
...
for i, l in enumerate(find_listing(data), 1):
print(i, l["name"], l["lat"], l["lng"])
Prints:
1 Mariandl (MHO103) for 36 persons. 47.16522 11.85723
2 central and friendly! For Families and Friends 47.16209 11.859691
3 Sonnenheim for 5 persons. 47.16809 11.86694
4 MO's Apartments 47.166969 11.863186
...

Print required lines from 3 files in Unix

There are 3 files in a directory. How can i print first file 1st line, Second file 3rd line and Third file 4th line using UNIX command ?
I tried with cat filename.txt| sed -n 1p but it is applicable for only one file. How can I view all the three files at a time ??
Using awk. at the beginning of each file f is increased to follow which file we're dealing with then we just team that up with the required record number of each file (FNR):
$ awk 'FNR==1 {f++} f==1&&FNR==1 || f==2&&FNR==3 || f==3&&FNR==4' 1 2 3
11
23
34
Record of the first file, the others are similar:
$ cat 1
11
12
13
14

bytes object has no attribute find_all

I've been trying for the last 3 hours to scrape this website and get the rank, name, wins, and losses of each team.
When implementing this code:
import requests
from bs4 import BeautifulSoup
halo = requests.get("https://www.halowaypoint.com/en-us/esports/standings")
page = BeautifulSoup(halo.content, "html.parser")
final = page.encode('utf-8')
print(final.find_all("div"))
I keep getting this error
If anyone can help me out then it would be much appreciated!
Thanks!
You are calling the the method on the wrong variable, use the BeautifulSoup object page not the byte string final:
print(page.find_all("div"))
To get the table data is pretty straightforward, all the data is inside the div with the css classes "table.table--hcs":
halo = requests.get("https://www.halowaypoint.com/en-us/esports/standings")
page = BeautifulSoup(halo.content, "html.parser")
table = page.select_one("div.table.table--hcs")
print(",".join([td.text for td in table.select("header div.td")]))
for row in table.select("div.tr"):
rank,team = row.select_one("span.numeric--medium.hcs-trend-neutral").text,row.select_one("div.td.hcs-title").span.a.text
wins, losses = [div.span.text for div in row.select("div.td.em-7")]
print(rank,team, wins, losses)
If we run the code, you can see the data matches the table:
In [4]: print(",".join([td.text for td in table.select("header div.td")]))
Rank,Team,Wins,Losses
In [5]: for row in table.select("div.tr"):
...: rank,team = row.select_one("span.numeric--medium.hcs-trend-neutral").text,row.select_one("div.td.hcs-title").span.a.text
...: wins, losses = [div.span.text for div in row.select("div.td.em-7")]
...: print(rank,team, wins, losses)
...:
1 Counter Logic Gaming 10 1
2 Team EnVyUs 8 3
3 Enigma6 8 3
4 Renegades 6 5
5 Team Allegiance 5 6
6 Evil Geniuses 4 7
7 OpTic Gaming 2 9
8 Team Liquid 1 10

Removing duplicate records from .Xdf file

I would like to remove the duplicate records from my large .xdf file trans.xdf.
Here is the file details:
File name: /poc/revor/data/trans.xdf
Number of observations: 1000000000
Number of variables: 5
Number of blocks: 40
Compression type: zlib
Variable information:
Var 1: CARD_ID, Type: character
Var 2: SE_NO, Type: character
Var 3: r12m_cv, Type: numeric, Low/High: (-2348.7600, 40587.3900)
Var 4: r12m_roc, Type: numeric, Low/High: (0.0000, 231.0000)
Var 5: PROD_GRP_CD, Type: character
Also below is the sample data of the file:
CARD_ID SE_NO r12m_cv r12m_roc PROD_GRP_CD
900000999000000000 1045815024 110 1 1
900000999000000000 1052487253 247.52 2 1
900000999000000000 9999999999 38.72 1 1
900000999000000000 1090389768 1679.96 16 1
900000999000000000 1091226035 0 1 1
900000999000000000 1091241208 538.68 4 1
900000999000000000 9999999999 83 1 1
900000999000000000 1091468041 148.4 3 1
900000999000000000 1092640358 3.13 1 1
900000999000000000 1093468692 546.29 1 1
I have tried using rxDataStep function to use its transform parameter to call to unique() function over the .xdf file. Below is the code for the same:
uniq_dat <- function( dataList )
{
datalist <- unique(datalist)
return(datalist)
}
rxDataStepXdf(inFile = "/poc/revor/data/trans.xdf",outFile = "/poc/revor/data/trans.xdf",transformFunc = uniq_dat,overwrite = TRUE)
But was getting below error:
Error in unique(datalist) : object 'datalist' not found
Error in transformation function: Error in unique(datalist) : object 'datalist' not found
Error in rxCall("RxDataStep", params) :
So anybody could point out the mistake that I am doing here or if there is a better way to remove the duplicate records from the .Xdf file. I am avoiding loading the data into inmemory dataframe as the data is pretty huge.
I am running the above code in Revolution R Environment over HDFS.
If the same can be obtained by any other approach then the example for the same would be appreciated.
Thanks for the help in advance :)
Cheers,
Amit
you can remove the duplicate values providing removeDupKeys=TRUE parameter for rxSort() function. For example for your case:
XdfFilePath <- file.path("<your file's fully qualified path>/trans.xdf")
rxSort(inData = XdfFilePath,sortByVars=c("CARD_ID","SE_NO","r12m_cv","r12m_roc","PROD_GRP_CD"), removeDupKeys=TRUE)
if you want to remove duplicate records based on a specific key column, for example, based on SE_NO column
set the key value as sortByVars="SE_NO"

fpc Pascal Runtime error 216 before execution ends

I was implementing adjacency list in Pascal (by first reading edge end points, and then using dynamic arrays to assign required amount of memory to edgelist of each node). The program executes fine, gives correct outputs but gives runtime error 216 just before exiting.
The code is :
type aptr = array of longint;
var edgebuf:array[1..200000,1..2] of longint;
ptrs:array[1..100000] of longint;
i,j,n,m:longint;
elist:array[1..100000] of aptr;
{main}
begin
readln(n,m);
fillchar(ptrs,sizeof(ptrs),#0);
for i:=1 to m do begin
readln(edgebuf[i][1],edgebuf[i][2]);
inc(ptrs[edgebuf[i][1]]);
end;
for i:=1 to n do begin
setlength(elist[i],ptrs[i]);
end;
fillchar(ptrs,sizeof(ptrs),#0);
for i:=1 to m do begin
inc(ptrs[edgebuf[i][1]]);
elist[edgebuf[i][1]][ptrs[edgebuf[i][1]]]:=edgebuf[i][2];
end;
for i:=1 to n do begin
writeln(i,' begins');
for j:=1 to ptrs[i] do begin
write(j,' ',elist[i][j],' ');
end;
writeln();
writeln(i,' ends');
end;
writeln('bye');
end.
When run on file
4 5
1 2
3 2
4 3
2 1
2 3
gives output:
1 begins
1 2
1 ends
2 begins
1 1 2 3
2 ends
3 begins
1 2
3 ends
4 begins
1 3
4 ends
bye
Runtime error 216 at $0000000000416644
$0000000000416644
$00000000004138FB
$0000000000413740
$0000000000400645
$00000000004145D2
$0000000000400180
Once the program says "bye", what is the program executing that is giving runtime error 216?
RTE 216 is in general fatal exceptions. GPF/SIGSEGV and in some cases SIGILL/SIGBUS, and that probably means that your program corrupts memory somewhere.
Compile with runtime checks on might help you find errors (Free Pascal : -Criot )

Resources