Reading data from URL - julia

Is there a reasonably easy way to get data from some url? I tried the most obvious version, does not work:
readcsv("https://dl.dropboxusercontent.com/u/.../testdata.csv")
I did not find any usable reference. Any help?

If you want to read a CSV from a URL, you can use the Requests package as #waTeim shows and then read the data through an IOBuffer. See example below.
Or, as #Colin T Bowers comments, you could use the currently (December 2017) more actively maintained HTTP.jl package like this:
julia> using HTTP
julia> res = HTTP.get("https://www.ferc.gov/docs-filing/eqr/q2-2013/soft-tools/sample-csv/transaction.txt");
julia> mycsv = readcsv(res.body);
julia> for (colnum, myheader) in enumerate(mycsv[1,:])
println(colnum, '\t', myheader)
end
1 transaction_unique_identifier
2 seller_company_name
3 customer_company_name
4 customer_duns_number
5 tariff_reference
6 contract_service_agreement
7 trans_id
8 transaction_begin_date
9 transaction_end_date
10 time_zone
11 point_of_delivery_control_area
12 specific location
13 class_name
14 term_name
15 increment_name
16 increment_peaking_name
17 product_name
18 transaction_quantity
19 price
20 units
21 total_transmission_charge
22 transaction_charge
Using the Requests.jl package:
julia> using Requests
julia> res = get("https://www.ferc.gov/docs-filing/eqr/q2-2013/soft-tools/sample-csv/transaction.txt");
julia> mycsv = readcsv(IOBuffer(res.data));
julia> for (colnum, myheader) in enumerate(mycsv[1,:])
println(colnum, '\t', myheader)
end
1 transaction_unique_identifier
2 seller_company_name
3 customer_company_name
4 customer_duns_number
5 tariff_reference
6 contract_service_agreement
7 trans_id
8 transaction_begin_date
9 transaction_end_date
10 time_zone
11 point_of_delivery_control_area
12 specific location
13 class_name
14 term_name
15 increment_name
16 increment_peaking_name
17 product_name
18 transaction_quantity
19 price
20 units
21 total_transmission_charge
22 transaction_charge

If you are looking to read into a dataframe, this will also work in Julia:
using CSV
dataset = CSV.read(download("https://mywebsite.edu/ml/machine-learning-databases/my.data"))

The Requests package seems to work pretty well. There are others (see the entire package list) but Requests is actively maintained.
Obtaining it
julia> Pkg.add("Requests")
julia> using Requests
Using it
You can use one of the exported functions that correspond to the various HTTP verbs get, post, etc which returns a Response type
julia> res = get("http://julialang.org")
Response(200 OK, 21 Headers, 20913 Bytes in Body)
julia> typeof(res)
Response (constructor with 8 methods)
And then, for example, you can print the data using #printf
julia> #printf("%s",res.data);
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-us" lang="en-us">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
...

If it is directly a csv file, something like this should work:
A = readdlm(download(url),';')

Nowadays you can also use UrlDownload.jl which is pure Julia, take care of download details, process data in-memory and can also work with compressed files.
Usage is straightforward
using UrlDownload
A = urldownload("https://data.ok.gov/sites/default/files/unspsc%20codes_3.csv")

Related

How would I go about web scraping from an interactive map?

This pertains to this interactive map, https://www.newworld-map.com/?filters=ores
An example is the ores here, how would I go about getting the coordinates of each node? It looks like the html element is a Canvas and I could not for the life of me figure out where it pulls the data from for this.
Any help would be greatly appreciated
Hoping that next OP's question will be more in line with Stackoverflow's guidelines (see https://stackoverflow.com/help/minimal-reproducible-example), one way to solve this would be to inspect what network calls are being made when page loads, and scrape an eventual API endpoint where the data is pulled from. Like below:
import requests
import pandas as pd
import time
time_stamp = int(time.time_ns() / 1000)
ore_list = []
url = f'https://www.newworld-map.com/markers.json?time={time_stamp}'
ores= requests.get(url).json()['ores']
for ore in ores:
for x in ores[ore]:
ore_list.append((ore, x, ores[ore][x]['x'], ores[ore][x]['y']))
df = pd.DataFrame(ore_list, columns = ['Ore', 'Code', 'X_Coord', 'Y_Coord'])
print(df)
Result in terminal:
Ore Code X_Coord Y_Coord
0 brimstone 02d1ba070438d53ce5fbb1955cd7d694 7473.096191 8715.674805
1 brimstone 0a50c499af034aeb6f38e011648a2ea8 7471.124512 8709.161133
2 brimstone 0b5b190c31eb3d314d993dd393aadfe8 5670.894043 7862.319336
3 brimstone 0f5c7427c75d80e10f71f9e92ddc4362 5883.601562 7703.445801
4 brimstone 20b0801bdb41c7dafbb1053b43c25bd8 6020.838379 8147.747070
... ... ... ... ...
4260 starmetal 86h 8766.964000 8431.438000
4261 starmetal 86i 8598.688000 8562.974000
4262 starmetal 86j 8586.000000 8211.000000
4263 starmetal 86k 8688.938000 8509.722000
4264 starmetal 86l 8685.827000 8505.694000
4265 rows × 4 columns

Line profiling with cython in jupyter notebook

I'm trying to use liner_profiler library in jupyter notebook with cython function. It is working only halfway. The result I get only consist of first row of the function and no profiling results.
%%cython -a
# cython: linetrace=True
# cython: binding=True
# distutils: define_macros=CYTHON_TRACE_NOGIL=1
import numpy as np
cimport numpy as np
from datetime import datetime
import math
cpdef np.int64_t get_days(np.int64_t year, np.int64_t month):
cdef np.ndarray months=np.array([31,28,31,30,31,30,31,31,30,31,30,31])
if month==2:
if (year%4==0 and year%100!=0) or (year%400==0):
return 29
return months[month-1]
For the profiling result int onlt shows one line of code
Timer unit: 1e-07 s
Total time: 0.0015096 s
File: .ipython\cython\_cython_magic_0154a9feed9bbd6e4f23e57d73acf50f.pyx
Function: get_days at line 15
Line # Hits Time Per Hit % Time Line Contents
==============================================================
15 cpdef np.int64_t get_days(np.int64_t year, np.int64_t month):
This can be seen as a bug in the line_profiler (if it is supposed to support Cython). To get the code of the profiled function, line_profiler reads the pyx-file and tries to extract the code with help of inspect.getblock:
...
# read pyx-file
all_lines = linecache.getlines(filename)
# try to extract body of the function strarting at start_lineno:
sublines = inspect.getblock(all_lines[start_lineno-1:])
...
However, getblock knows nothing about cpdef-function, as python has only def-functions and thus yields wrong function-body (i.e. only the signature).
Workaround:
A simple work around would be to introduce a dummy def-function, which would be a sentinel for the cpdef-function in such a way, that inspect.getblock would yield the whole body of the cpdef-function + body of the the sentinel function, i.e.:
%%cython
...
cpdef np.int64_t get_days(np.int64_t year, np.int64_t month):
...
def get_days_sentinel():
pass
and now the report %lprun -f get_days get_days(2019,3) looks as follows:
Timer unit: 1e-06 s
Total time: 1.7e-05 s
File: XXXX.pyx
Function: get_days at line 10
Line # Hits Time Per Hit % Time Line Contents
==============================================================
10 cpdef np.int64_t get_days(np.int64_t year, np.int64_t month):
11 1 14.0 14.0 82.4 cdef np.ndarray months=np.array([31,28,31,30,31,30,31,31,30,31,30,31])
12 1 1.0 1.0 5.9 if month==2:
13 if (year%4==0 and year%100!=0) or (year%400==0):
14 return 29
15 1 2.0 2.0 11.8 return months[month-1]
16
17 def get_days_sentinel():
18 pass
There are still somewhat ugly trailing lines from the sentinel, but it is probably better as not seeing anything at all.

is it possible to get a new instance for namedtuple pushed into a dictionary before values are known?

It looks like things are going wrong on line 9 for me. Here I wish to push a new copy of the TagsTable into a dictionary. I'm aware that once a namedtuple field is recorded, it can not be changed. However, results baffle me as it looks like the values do change - when this code exits all entries of mp3_tags[ any of the three dictionary keys ].date are set to the last date of "1999_03_21"
So, two questions:
Is there a way to get a new TagsTable pushed into the dictionary ?
Why doesnt the code fail and not allow the second (and even third) date to be written to the TagsTable.date field (since it seems to be references to the same namedtuple) ? I thought you could not write a second value ?
from collections import namedtuple
2 TagsTable = namedtuple('TagsTable',['title','date','subtitle','artist','summary','length','duration','pub_date'])
3 mp3files = ['42-001.mp3','42-002.mp3','42-003.mp3']
4 dates = ['1999_01_07', '1999_02_14', '1999_03_21']
5
6 mp3_tags = {}
7
8 for mp3file in mp3files:
9 mp3_tags[mp3file] = TagsTable
10
11 for mp3file,date_string in zip(mp3files,dates):
12 mp3_tags[mp3file].date = date_string
13
14 for mp3file in mp3files:
15 print( mp3_tags[mp3file].date )
looks like this is the fix I was looking for:
from collections import namedtuple
mp3files = ['42-001.mp3','42-002.mp3','42-003.mp3']
dates = ['1999_01_07', '1999_02_14', '1999_03_21']
mp3_tags = {}
for mp3file in mp3files:
mp3_tags[mp3file] = namedtuple('TagsTable',['title','date','subtitle','artist','summary','length','duration','pub_date'])
for mp3file,date_string in zip(mp3files,dates):
mp3_tags[mp3file].date = date_string
for mp3file in mp3files:
print( mp3_tags[mp3file].date )

removing some part of string from url

I want to remove jsessionid from given string url and backslash #start
/product.screen?productId=BS-AG-G09&JSESSIONID=SD1SL6FF6ADFF6510
so that output would be like
product.screen?productId=BS-AG-G09
More data like :
1 /product.screen?productId=WC-SH-A02&JSESSIONID=SD0SL6FF7ADFF4953
2 /oldlink?itemId=EST-6&JSESSIONID=SD0SL6FF7ADFF4953
3 /product.screen?productId=BS-AG-G09&JSESSIONID=SD0SL6FF7ADFF4953
4 /product.screen?productId=FS-SG-G03&JSESSIONID=SD0SL6FF7ADFF4953
5 /cart.do?action=remove&itemId=EST-11&productId=WC-SH-A01&JSESSIONID=SD0SL6FF7ADFF4953
6 /oldlink?itemId=EST-14&JSESSIONID=SD0SL6FF7ADFF4953
7 /cart.do?action=view&itemId=EST-6&productId=MB-AG-T01&JSESSIONID=SD1SL6FF6ADFF6510
8 /product.screen?productId=BS-AG-G09&JSESSIONID=SD1SL6FF6ADFF6510
9 /product.screen?productId=WC-SH-A02&JSESSIONID=SD1SL6FF6ADFF6510
10 /cart.do?action=view&itemId=EST-6&productId=WC-SH-A02&JSESSIONID=SD1SL6FF6ADFF6510
11 /product.screen?productId=WC-SH-A02&JSESSIONID=SD1SL6FF6ADFF6510
You may use:
library(stringi)
lf1 = "/product.screen?productId=BS-AG-G09&JSESSIONID=SD0SL6FF7ADFF4953"
stri_replace_all_regex(
"/product.screen?productId=BS-AG-G09&JSESSIONID=SD0SL6FF7ADFF4953",
"&JSESSIONID=.*","")
Where the string: &JSESSIONID=.* (up to the end .*) gets replaced with nothing ("").
or simply: gsub("&JSESSIONID=.*","",lf1)

Exception importing data into neo4j using batch-import

I am running neo-4j 1.8.2 on a remote unix box. I am using this jar (https://github.com/jexp/batch-import/downloads).
nodes.csv is same as given in example:
name age works_on
Michael 37 neo4j
Selina 14
Rana 6
Selma 4
rels.csv is like this:
start end type since counter:int
1 2 FATHER_OF 1998-07-10 1
1 3 FATHER_OF 2007-09-15 2
1 4 FATHER_OF 2008-05-03 3
3 4 SISTER_OF 2008-05-03 5
2 3 SISTER_OF 2007-09-15 7
But i am getting this exception :
Using Existing Configuration File
Total import time: 0 seconds
Exception in thread "main" java.util.NoSuchElementException
at java.util.StringTokenizer.nextToken(StringTokenizer.java:332)
at org.neo4j.batchimport.Importer$Data.split(Importer.java:156)
at org.neo4j.batchimport.Importer$Data.update(Importer.java:167)
at org.neo4j.batchimport.Importer.importNodes(Importer.java:226)
at org.neo4j.batchimport.Importer.main(Importer.java:83)
I am new to neo4j, was checking if this importer can save some coding effort.
It would be great if someone can point to the probable mistake.
Thanks for help!
--Edit:--
My nodes.csv
name dob city state s_id balance desc mgr_primary mgr_secondary mgr_tertiary mgr_name mgr_status
John Von 8/11/1928 Denver CO 1114-010 7.5 RA 0023-0990 0100-0110 Doozman Keith Active
my rels.csv
start end type since status f_type f_num
2 1 address_of
1 3 has_account 5 Active
4 3 f_of Primary 0111-0230
Hi I had some issues in the past with the batch import script.
The formating of your file must be very rigorous, which means :
no extra spaces where not expected, like the ones I see in the first line of your rels.csv before "start"
no multiple spaces in place of the tab. If your files are exactly like what you've copied here, you have 4 spaces instead of on tab, and this is not going to work, as the script uses a tokenizer looking for tabs !!!
I had this issue because I always convert tabs to 4 spaces, and once I understood that, I stopped doing it for my csv !

Resources