I have a CSV that I've loaded into Google Cloud Storage, and I am creating a Dataflow pipeline that will read and process the CSV, then perform a count of of listings by a single column.
How do I isolate the single column. Let's say the columns are id, city, sports_team. I want to count how many occurrences of a city show up.
My starting code is like so:
# Python's regular expression library
import re
# Beam and interactive Beam imports
import apache_beam as beam
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
import apache_beam.runners.interactive.interactive_beam as ib
class SplitRecords(beam.DoFn):
"""Spilt the element into records, return rideable_type record."""
def process(self, element):
records = element.split(",")
return [records[1]]
p = beam.Pipeline(InteractiveRunner())
lines = p | 'read in file' >> beam.io.ReadFromText("gs://ny-springml-data/AB_NYC_2019.csv", skip_header_lines=1)
records = lines | beam.ParDo(SplitRecords())
groups = (records | beam.Map(lambda x: (x, 1)) | beam.CombinePerKey(sum))
groups | beam.io.WriteToText('TEST2.txt')
I am getting an IndexError: list index out of range.... I'm extremely newb at all of this, so any help is appreciated.
Presumably there's some unexpected line in your CVS file, e.g. a blank one. You could do something like
if len(records) < 2:
raise ValueError("Bad line: %r" % element)
else:
yield records[1]
to get a better error message. I would also recommend looking into using Beam Dataframes for this kind of task.
Related
Suppose I have a table, I run a query like below :-
let data = orders | where zip == "11413" | project timestamp, name, amount ;
inject data into newOrdersInfoTable in another cluster // ==> How can i achieve this?
There are many ways to do it. If it is a manual task and with not too much data you can simply do something like this in the target cluster:
.set-or-append TargetTable <| cluster("put here the source cluster url").database("put here the source database").orders
| where zip == "11413" | project timestamp, name, amount
Note that if the dataset is larger you can use the "async" flavor of this command. If the data size is bigger then you should consider exporting the data and importing it to the other cluster.
I want to test my wordcount software based on MapReduce framework with a very large file (over 1GB) but I don't know how can I generate it.
Are there any tools to create a large file with random but sensible english sentences?
Thanks
A simple python script can create a Pseudo-random document of words. I have the one I wrote up for just a task a year ago:
import random
file1 = open("test.txt","a")
PsudoRandomWords = ["Apple ", "Banana ", "Tree ", "Pickle ", "Toothpick ", "Coffee ", "Done "]
index = 0
#Increase the range to make a bigger file
for x in range(150000000):
#Change end range of the randint function below if you add more words
index = random.randint(0,6)
file1.write(PsudoRandomWords[index])
if x % 20 == 0:
file1.write('\n')`
Just add more words to the list to make it more random and increase the index of the random function. I just tested it and it should create a document named test.txt at exactly one gigabyte. This will contain words from the list in a random order separated by a new line every 20 words.
I wrote this simple Python script that scrape on Project Gutenberg site and write the text (encoding: us-ascii, if you want to use others see http://www.gutenberg.org/files/) in a local file text. This script can be used in combination with https://github.com/c-w/gutenberg to do more accurate filtering (by language, by author etc.)
from __future__ import print_function
import requests
import sys
if (len(sys.argv)!=2):
print("[---------- ERROR ----------] Usage: scraper <number_of_files>", file=sys.stderr)
sys.exit(1)
number_of_files=int(sys.argv[1])
text_file=open("big_file.txt",'w+')
for i in range(number_of_files):
url='http://www.gutenberg.org/files/'+str(i)+'/'+str(i)+'.txt'
resp=requests.get(url)
if resp.status_code!=200:
print("[X] resp.status_code =",resp.status_code,"for",url)
continue
print("[V] resp.status_code = 200 for",url)
try:
content=resp.text
#dummy cleaning of the text
splitted_content=content.split("*** START OF THIS PROJECT GUTENBERG EBOOK")
splitted_content=splitted_content[1].split("*** END OF THIS PROJECT GUTENBERG EBOOK")
print(splitted_content[0], file = text_file)
except:
continue
text_file.close()
I have a complex JSON file (~8GB) containing publically available data for businesses. We have decided to split the files up into multiple CSV files (or tabs in a .xlsx), so clients can easily consume the data. These files will be linked by the NZBN column/key.
I'm using R and jsonlite to read a small sample in (before scaling up to the full file). I'm guessing I need some way to specify what key/columns go in each file (i.e, the first file will have headers: australianBusinessNumber, australianCompanyNumber, australianServiceAddress, the second file will have headers: annualReturnFilingMonth, annualReturnLastFiled, countryOfOrigin...)
Here's a sample of two businesses/entities (I've bunged some of the data as well so ignore the actual values): test file
I've read almost every post on s/o of similar questions and none seem to be giving me any luck. I've tried variations of purrr, *apply commands, custom flattening functions and jqr (an r version of 'jq' - looks promising but I can't seem to run it).
Here's an attempt at creating my separate files, but I'm unsure how to include the linking identifier (NZBN) + I keep running into further nested lists (i'm unsure how many levels of nesting there are)
bulk <- jsonlite::fromJSON("bd_test.json")
coreEntity <- data.frame(bulk$companies)
coreEntity <- coreEntity[,sapply(coreEntity, is.list)==FALSE]
company <- bulk$companies$entity$company
company <- purrr::reduce(company, dplyr::bind_rows)
shareholding <- company$shareholding
shareholding <- purrr::reduce(shareholding, dplyr::bind_rows)
shareAllocation <- shareholding$shareAllocation
shareAllocation <- purrr::reduce(shareAllocation, dplyr::bind_rows)
I'm not sure if it's easier to split the files up during the flattening/wrangling process, or just completely flatten the whole file so I just have one line per business/entity (and then gather columns as needed) - my only concern is that I need to scale this up to ~1.3million nodes (8GB JSON file).
Ideally I would want the csv files split every time there is a new collection, and the values in the collection would become the columns for the new csv/tab.
Any help or tips would be much appreciated.
------- UPDATE ------
Updated as my question was a little vague I think all I need is some code to produce one of the csv's/tabs and I replicate for the other collections.
Say for example, I wanted to create a csv of the following elements:
entityName (unique linking identifier)
nzbn (unique linking
identifier)
emailAddress__uniqueIdentifier
emailAddress__emailAddress
emailAddress__emailPurpose
emailAddress__emailPurposeDescription
emailAddress__startDate
How would I go about that?
i'm unsure how many levels of nesting there are
This will provide an answer to that quite efficiently:
jq '
def max(s): reduce s as $s (null;
if . == null then $s elif $s > . then $s else . end);
max(paths|length)' input.json
(With the test file, the answer is 14.)
To get an overall view (schema) of the data, you could
run:
jq 'include "schema"; schema' input.json
where schema.jq is available at this gist. This will produce a structural schema.
"Say for example, I wanted to create a csv of the following elements:"
Here's a jq solution, apart from the headers:
.companies.entity[]
| [.entityName, .nzbn]
+ (.emailAddress[] | [.uniqueIdentifier, .emailAddress, .emailPurpose, .emailPurposeDescription, .startDate])
| #csv
shareholding
The shareholding data is complex, so in the following I've used the to_table function defined elsewhere on this page.
The sample data does not include a "company name" field so in the following, I've added a 0-based "company index" field:
.companies.entity[]
| [.entityName, .nzbn] as $ix
| .company
| range(0;length) as $cix
| .[$cix]
| $ix + [$cix] + (.shareholding[] | to_table(false))
jqr
The above solutions use the standalone jq executable, but all going well, it should be trivial to use the same filters with jqr, though to use jq's include, it might be simplest to specify the path explicitly, as for example:
include "schema" {search: "~/.jq"};
If the input JSON is sufficiently regular, you
might find the following flattening function helpful, especially as it can emit a header in the form of an array of strings based on the "paths" to the leaf elements of the input, which can be arbitrarily nested:
# to_table produces a flat array.
# If hdr == true, then ONLY emit a header line (in prettified form, i.e. as an array of strings);
# if hdr is an array, it should be the prettified form and is used to check consistency.
def to_table(hdr):
def prettify: map( (map(tostring)|join(":") ));
def composite: type == "object" or type == "array";
def check:
select(hdr|type == "array")
| if prettify == hdr then empty
else error("expected head is \(hdr) but imputed header is \(.)")
end ;
. as $in
| [paths(composite|not)] # the paths in array-of-array form
| if hdr==true then prettify
else check, map(. as $p | $in | getpath($p))
end;
For example, to produce the desired table (without headers) for .emailAddress, one could write:
.companies.entity[]
| [.entityName, .nzbn] as $ix
| $ix + (.emailAddress[] | to_table(false))
| #tsv
(Adding the headers and checking for consistency,
are left as an exercise for now, but are dealt with below.)
Generating multiple files
More interestingly, you could select the level you want, and produce multiple tables automagically. One way to partition the output into separate files efficiently would be to use awk. For example, you could pipe the output obtained using this jq filter:
["entityName", "nzbn"] as $common
| .companies.entity[]
| [.entityName, .nzbn] as $ix
| (to_entries[] | select(.value | type == "array") | .key) as $key
| ($ix + [$key] | join("-")) as $filename
| (.[$key][0]|to_table(true)) as $header
# First emit the line giving all the headers:
| $filename, ($common + $header | #tsv),
# Then emit the rows of the table:
(.[$key][]
| ($filename, ($ix + to_table(false) | #tsv)))
to
awk -F\\t 'fn {print >> fn; fn=0;next} {fn=$1".tsv"}'
This will produce headers in each file; if you want consistency checking, change to_table(false) to to_table($header).
I am a medical doctor trying to model a drugs to enzymes database and am starting with a CSV file I use to load my data into the Gephi graph layouting program. I understand the power of a graph db but am illiterate with cypher:
The current CSV has the following format:
source;target;arc_type; <- this is an header needed for Gephi import
artemisinin;2B6;induces;
...
amiodarone;1A2;represses;
...
3A457;carbamazepine;metabolizes;
These sample records show the three types of relationships. Drugs can repress or augment a cytochrome, and cytochromes metabolize drugs.
Is there a way to use this CSV as is to load into neo4j and create the graph?
Thank you very much.
In neo4j terminology, a relationship must have "type", and a node can have any number of labels. It looks like your use case could benefit from labelling your nodes with either Drug or Cytochrome.
Here is a possible neo4j data model for your use case:
(:Drug)-[:MODULATES {induces: false}]->(:Cytochrome)
(:Cytochrome)-[:METABOLIZES]->(:Drug)
The induces property has a boolean value indicating whether a drug induces (true) or represses (false) the related cythochrome.
The following is a (somewhat complex) query that generates the above data model from your CSV file:
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///Drugs.csv' AS line FIELDTERMINATOR ';'
WITH line,
CASE line.arc_type
WHEN 'metabolizes' THEN {a: [1]}
WHEN 'induces' THEN {b: [true]}
ELSE {b: [false]}
END AS todo
FOREACH (ignored IN todo.a |
MERGE (c:Cytochrome {id: line.source})
MERGE (d:Drug {id: line.target})
MERGE (c)-[:METABOLIZES]->(d)
)
FOREACH (induces IN todo.b |
MERGE (d:Drug {id: line.source})
MERGE (c:Cytochrome {id: line.target})
MERGE (d)-[:MODULATES {induces: induces}]->(c)
)
The FOREACH clause does nothing if the value after the IN is null.
Yes it's possible, but you will need to install APOC : a list of usefull stored procedures for Neo4j. You can find it here : https://neo4j-contrib.github.io/neo4j-apoc-procedures/
Then you should put your CSV file into the import folder of Neo4j, and run those queries :
The first one to create a unique constraint on :Node(name) :
CREATE CONSTRAINT ON (n:Node) ASSERT n.name IS UNIQUE;
And then this query to import your data :
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///my-csv-file.csv' AS line
MERGE (n:Node {name:line.source})
MERGE (m:Node {name:line.target})
CALL apoc.create.relationship(n, line.arc_type,{}, m)
i have implemented https://dmorgan.info/posts/common-crawl-python/ as described in this link. However, I want to crawl entire data rather than partial data unlike as described in this post. So, in this code chunk,
def get_partial_warc_file(url, num_bytes=1024 * 10):
with closing(requests.get(url, stream=True)) as r:
buf = StringIO(r.raw.read(num_bytes))
return warc.WARCFile(fileobj=buf, compress=True)
I have made the following change:
def get_partial_warc_file(url):
with closing(requests.get(url, stream=True)) as r:
buf = StringIO(r.raw.data)
return warc.WARCFile(fileobj=buf, compress=True)
This code chunk increases the number of records for a given warc path but it does not crawl entire number of records. I can't find a possible reason for the same. Any help would be appreciated.