How can I create a large file with random but sensible English words? - bigdata

I want to test my wordcount software based on MapReduce framework with a very large file (over 1GB) but I don't know how can I generate it.
Are there any tools to create a large file with random but sensible english sentences?
Thanks

A simple python script can create a Pseudo-random document of words. I have the one I wrote up for just a task a year ago:
import random
file1 = open("test.txt","a")
PsudoRandomWords = ["Apple ", "Banana ", "Tree ", "Pickle ", "Toothpick ", "Coffee ", "Done "]
index = 0
#Increase the range to make a bigger file
for x in range(150000000):
#Change end range of the randint function below if you add more words
index = random.randint(0,6)
file1.write(PsudoRandomWords[index])
if x % 20 == 0:
file1.write('\n')`
Just add more words to the list to make it more random and increase the index of the random function. I just tested it and it should create a document named test.txt at exactly one gigabyte. This will contain words from the list in a random order separated by a new line every 20 words.

I wrote this simple Python script that scrape on Project Gutenberg site and write the text (encoding: us-ascii, if you want to use others see http://www.gutenberg.org/files/) in a local file text. This script can be used in combination with https://github.com/c-w/gutenberg to do more accurate filtering (by language, by author etc.)
from __future__ import print_function
import requests
import sys
if (len(sys.argv)!=2):
print("[---------- ERROR ----------] Usage: scraper <number_of_files>", file=sys.stderr)
sys.exit(1)
number_of_files=int(sys.argv[1])
text_file=open("big_file.txt",'w+')
for i in range(number_of_files):
url='http://www.gutenberg.org/files/'+str(i)+'/'+str(i)+'.txt'
resp=requests.get(url)
if resp.status_code!=200:
print("[X] resp.status_code =",resp.status_code,"for",url)
continue
print("[V] resp.status_code = 200 for",url)
try:
content=resp.text
#dummy cleaning of the text
splitted_content=content.split("*** START OF THIS PROJECT GUTENBERG EBOOK")
splitted_content=splitted_content[1].split("*** END OF THIS PROJECT GUTENBERG EBOOK")
print(splitted_content[0], file = text_file)
except:
continue
text_file.close()

Related

Perform a transformation on a single column in Apache beam

I have a CSV that I've loaded into Google Cloud Storage, and I am creating a Dataflow pipeline that will read and process the CSV, then perform a count of of listings by a single column.
How do I isolate the single column. Let's say the columns are id, city, sports_team. I want to count how many occurrences of a city show up.
My starting code is like so:
# Python's regular expression library
import re
# Beam and interactive Beam imports
import apache_beam as beam
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
import apache_beam.runners.interactive.interactive_beam as ib
class SplitRecords(beam.DoFn):
"""Spilt the element into records, return rideable_type record."""
def process(self, element):
records = element.split(",")
return [records[1]]
p = beam.Pipeline(InteractiveRunner())
lines = p | 'read in file' >> beam.io.ReadFromText("gs://ny-springml-data/AB_NYC_2019.csv", skip_header_lines=1)
records = lines | beam.ParDo(SplitRecords())
groups = (records | beam.Map(lambda x: (x, 1)) | beam.CombinePerKey(sum))
groups | beam.io.WriteToText('TEST2.txt')
I am getting an IndexError: list index out of range.... I'm extremely newb at all of this, so any help is appreciated.
Presumably there's some unexpected line in your CVS file, e.g. a blank one. You could do something like
if len(records) < 2:
raise ValueError("Bad line: %r" % element)
else:
yield records[1]
to get a better error message. I would also recommend looking into using Beam Dataframes for this kind of task.

Running couple of tests in robot framework infinitely

How can I run a couple of tests in robot framework infinitely or atleast a large number of times finitely.
Eg:
Test case 1
.
.
.
Test case 2
.
.
.
Test case 3
.
.
.
I want Tests to run in the order 1,2,3,1,2,3... finitely (for a large number) or infinitely.
I know how to do it for a single test. But I want it to come back and do test 1 after test 3. And i want this batch to run in a loop.
It is not possible to create a infinite loop within RF which will run the current file over and over again indefinitely. Instead, you could create a script which points to the RF file, handles the infinity(ness) for you, and then when needed to, kill the process and join all the output.xml's together, creating the mother of all mothers RF reports. Here is a quick example within Python:
import subprocess
import os
import glob
try:
while True:
subprocess.call("robot EnterFileNameHere.robot") # Add any robot options you may want
except KeyboardInterrupt:
total = []
os.chdir("/DirectoryWhich/HasAll/TheXML/Files")
for GrabbedFiles in glob.glob("*.xml"):
total += GrabbedFiles
Converted = " ".join(str(x) for x in total)
subprocess.call("rebot {0}".format(Converted)) # Add any rebot options you may want
Change the directories to match where your files are, and this should infinitely fire off your robot file of choice, constantly creating report files / output files. Once you kill it (with CTRL+C) it will accept that as a KeyboardInterrupt which will then, merge all of the output files for you, and then close the terminal.
The only other way to do this within RF itself is by this answer here but this would only generate a report for you once the loop is completed. I do not know how it would handle report generation if you suddenly killed RF. I presume it wouldn't create any reports at all. So personally, I think this is your best bet.
Any questions let me know.

Tensorflow: How to convert .meta, .data and .index model files into one graph.pb file

In tensorflow the training from the scratch produced following 6 files:
events.out.tfevents.1503494436.06L7-BRM738
model.ckpt-22480.meta
checkpoint
model.ckpt-22480.data-00000-of-00001
model.ckpt-22480.index
graph.pbtxt
I would like to convert them (or only the needed ones) into one file graph.pb to be able to transfer it to my Android application.
I tried the script freeze_graph.py but it requires as an input already the input.pb file which I do not have. (I have only these 6 files mentioned before). How to proceed to get this one freezed_graph.pb file? I saw several threads but none was working for me.
You can use this simple script to do that. But you must specify the names of the output nodes.
import tensorflow as tf
meta_path = 'model.ckpt-22480.meta' # Your .meta file
output_node_names = ['output:0'] # Output nodes
with tf.Session() as sess:
# Restore the graph
saver = tf.train.import_meta_graph(meta_path)
# Load weights
saver.restore(sess,tf.train.latest_checkpoint('path/of/your/.meta/file'))
# Freeze the graph
frozen_graph_def = tf.graph_util.convert_variables_to_constants(
sess,
sess.graph_def,
output_node_names)
# Save the frozen graph
with open('output_graph.pb', 'wb') as f:
f.write(frozen_graph_def.SerializeToString())
If you don't know the name of the output node or nodes, there are two ways
You can explore the graph and find the name with Netron or with console summarize_graph utility.
You can use all the nodes as output ones as shown below.
output_node_names = [n.name for n in tf.get_default_graph().as_graph_def().node]
(Note that you have to put this line just before convert_variables_to_constants call.)
But I think it's unusual situation, because if you don't know the output node, you cannot use the graph actually.
As it may be helpful for others, I also answer here after the answer on github ;-).
I think you can try something like this (with the freeze_graph script in tensorflow/python/tools) :
python freeze_graph.py --input_graph=/path/to/graph.pbtxt --input_checkpoint=/path/to/model.ckpt-22480 --input_binary=false --output_graph=/path/to/frozen_graph.pb --output_node_names="the nodes that you want to output e.g. InceptionV3/Predictions/Reshape_1 for Inception V3 "
The important flag here is --input_binary=false as the file graph.pbtxt is in text format. I think it corresponds to the required graph.pb which is the equivalent in binary format.
Concerning the output_node_names, that's really confusing for me as I still have some problems on this part but you can use the summarize_graph script in tensorflow which can take the pb or the pbtxt as an input.
Regards,
Steph
I tried the freezed_graph.py script, but the output_node_name parameter is totally confusing. Job failed.
So I tried the other one: export_inference_graph.py.
And it worked as expected!
python -u /tfPath/models/object_detection/export_inference_graph.py \
--input_type=image_tensor \
--pipeline_config_path=/your/config/path/ssd_mobilenet_v1_pets.config \
--trained_checkpoint_prefix=/your/checkpoint/path/model.ckpt-50000 \
--output_directory=/output/path
The tensorflow installation package I used is from here:
https://github.com/tensorflow/models
First, use the following code to generate the graph.pb file.
with tf.Session() as sess:
# Restore the graph
_ = tf.train.import_meta_graph(args.input)
# save graph file
g = sess.graph
gdef = g.as_graph_def()
tf.train.write_graph(gdef, ".", args.output, True)
then, use summarize graph get the output node name.
Finally, use
python freeze_graph.py --input_graph=/path/to/graph.pbtxt --input_checkpoint=/path/to/model.ckpt-22480 --input_binary=false --output_graph=/path/to/frozen_graph.pb --output_node_names="the nodes that you want to output e.g. InceptionV3/Predictions/Reshape_1 for Inception V3 "
to generate the freeze graph.

requests.get() not crawling entire common crawl records for a given warc path

i have implemented https://dmorgan.info/posts/common-crawl-python/ as described in this link. However, I want to crawl entire data rather than partial data unlike as described in this post. So, in this code chunk,
def get_partial_warc_file(url, num_bytes=1024 * 10):
with closing(requests.get(url, stream=True)) as r:
buf = StringIO(r.raw.read(num_bytes))
return warc.WARCFile(fileobj=buf, compress=True)
I have made the following change:
def get_partial_warc_file(url):
with closing(requests.get(url, stream=True)) as r:
buf = StringIO(r.raw.data)
return warc.WARCFile(fileobj=buf, compress=True)
This code chunk increases the number of records for a given warc path but it does not crawl entire number of records. I can't find a possible reason for the same. Any help would be appreciated.

How to find a common variable in a large number of databases using Stata

So I have a large number of databases (82) in Stata, that each contain around 1300 variables and several thousand observations. Some of these databases contain variables that give the mean or standard deviation of certain concepts. For example, a variable in such a dataset could be called "leverage_mean". Now, I want to know which datasets contain variables called concept_mean or concept_sd, without having to go through every dataset by hand.
I was thinking that maybe there is a way to loop through the databases looking for variables containing "mean" or "sd", unfortunately I have idea how to do this. I'm using R and Stata datafiles.
Yes, you can do this with a loop in stata as well as R. First, you should check out the stata command ds and the package findname, which will do many of the things described here and much more. But to show you what is happening "under the hood", I'll show the Stata code that can achieve this below:
/*Set your current directory to the location of your databases*/
cd "[your cd here]"
Save the names of the 82 databases to a list called "filelist" using stata's dir function for macros. NOTE: you don't specify what kind of file your database files are, so I'm assuming .xls. This command saves all files with extension ".xls" into the list. What type of file you save into the list and how you import your database will depend on what type of files you are reading in.
local filelist : dir . files "*.xls"
Then loop over all files to show which ones contain variables that end with "_sd" or "_mean".
foreach file of local filelist {
/*import the data*/
import excel "`file'", firstrow clear case(lower)
/*produce a list of the variables that end with "_sd" and "_mean"*/
cap quietly describe *_sd *_mean, varlist
if length("r(varlist)") > 0 {
/*If the database contains variables of interest, display the database file name and variables on screen*/
display "Database `file' contains variables: " r(varlist)
}
}
Final note, this loop will only display the database name and variables of interest contained within it. If you want to perform actions on the data, or do anything else, those actions need to be included in the position of the final "display" command (which you may or may not ultimately actually need).
You can use filelist, (from SSC) to create a dataset of files. To install filelist, type in Stata's Command window:
ssc install filelist
With a list of datasets in memory, you can then loop over each file and use describe to get a list of variables for each file. You can store this list of variables in a single string variable. For example, the following will collect the names of all Stata datasets shipped with Stata and then store for each the variables they contain:
findfile "auto.dta"
local base_dir = subinstr("`r(fn)'", "/a/auto.dta", "", 1)
dis "`base_dir'"
filelist, dir("`base_dir'") pattern("*.dta")
gen variables = ""
local nmatch = _N
qui forvalues i = 1/`nmatch' {
local f = dirname[`i'] + "/" + filename[`i']
describe using "`f'", varlist
replace variables = " `r(varlist)' " in `i'
}
leftalign // also from SSC, to install: ssc install leftalign
Once you have all this information in the data in memory, you can easily search for specific variables. For example:
. list filename if strpos(variables, " rep78 ")
+-----------+
| filename |
|-----------|
13. | auto.dta |
14. | auto2.dta |
+-----------+
The lookfor_all package (SSC) is there for that purpose:
cd "pathtodirectory"
lookfor_all leverage_mean
Just make sure the file extensions are in lowercase(.dta) and not upper.

Resources