Getting the list of ALL topic names from Freebase - web-scraping

According to Freebase, they have 23,407,174 topics. What is the easiest way to get the UI friendly names (essentially the 'text' attribute of the topic JSON, example of a single topic JSON is here) of ALL of these TOPICs? I don't need any other meta information.

wget -O - http://download.freebase.com/datadumps/latest/freebase-simple-topic-dump.tsv.bz2 | bunzip2 | cut -f 2 > freebase-topic-names.txt
although you probably want the Freebase IDs as well so that you know what the names refer to:
wget -O - http://download.freebase.com/datadumps/latest/freebase-simple-topic-dump.tsv.bz2 | bunzip2 | cut -f 1,2
Two additional bits of postprocessing are needed:
Tabs are escaped as \t
The string \N represents a null (non-existent) name

Take a look at the Simple Topic Dump that we provide. It's over a GB of compressed data but its still faster to download than trying to get all the names through the API.

Related

Can sort command be used to sort file based on multiple columns in a csv file

We have a requirement where we have a csv file with custom delimiter '||' (double-pipes) . We have 40 columns in the file and the file size is approximately between 400 to 500 MB.
We need to sort the file based on 2 columns, first on column 4 and then by column 17.
We found this command using which we can sort for one column, but not able to find a command which can sort based on both columns.
Since we use a delimiter with 2 characters, we are using awk command for sorting.
Command:
awk -F \|\| '{print $4}' abc.csv | sort > output.csv
Please advise.
If your inputs are not too fancy (no newlines in the middle of a record, for instance), the sort utility can almost do what you want, but it supports only one-character field separators. So || would not work. But wait, if you do not have other | characters in your files, we could just consider | as the field separator and account for the extra empty fields:
sort -t'|' -k7 -k33 foo.csv
We sort by fields 7 (instead of 4) and then 33 (instead of 17) because of these extra empty fields. The formula that gives the new field number is simply 2*N-1 where N is the original field number.
If you do have | characters inside your fields a simple solution is to substitute them all by one unused character, sort, and restore the original ||. Example with tabs:
sed 's/||/\t/g' foo.csv | sort -t$'\t' -k4 -k17 | sed 's/\t/||/g'
If tab is also used in your fields chose any unused character instead. Form feed (\f) or the field separator (ASCII code 28, that is, replace the 3 \t with \x1c) are good candidates.
Using PROCINFO in gnu-awk you can use this solution to sort on multi-character delimiter:
awk -F '\\|\\|' '{a[$2,$17] = $0} END {
PROCINFO["sorted_in"]="#ind_str_asc"; for (i in a) print a[i]}' file.csv
You could try following awk code. Written as per your shown attempts only. Set OFS as |(this is putting | as output field separator in case you want it ,comma etc then change OFS value accordingly in program) and print 17th field also as per your requirement in awk program. In sort use 1st and 2nd fields to sort it(because now 4th and 17th fields have become 1st and 2nd fields respectively for sort).
awk -F'\\|\\|' -v OFS='\\|' '{print $4,$17}' abc.csv | sort -t'|' -k1.1 -k2.1 > output.csv
The sort command works on physical lines, which may or may not be acceptable. CSV files can contain quoted fields which contain newlines, which will throw off sort (and most other Unix line-oriented utilities; it's hard to write a correct Awk script for this scenario, too).
If you need to be able to manipulate arbitrary CSV files, probably look to a dedicated utility, or use a scripting language with proper CSV support. For example, assume you have a file like this:
Title,Number,Arbitrary text
"He said, ""Hello""",2,"There can be
newlines and
stuff"
No problem,1,Simple undramatic single-line CSV
In case it's not obvious, CSV is fundamentally just a text file, with some restrictions on how it can be formatted. To be valid CSV, every record should be comma-separated; any literal commas or newlines in the data needs to be quoted, and any literal quotes need to be doubled. There are many variations; different tools accept slightly different dialects. One common variation is TSV which uses tabs instead of commas as delimiters.
Here is a simple Python script which sorts the above file on the second field.
import csv
import sys
with open("test.csv", "r") as csvfile:
csvdata = csv.reader(csvfile)
lines = [line for line in csvdata]
titles = lines.pop(0) # comment out if you don't have a header
writer = csv.writer(sys.stdout)
writer.writerow(titles) # comment out if you don't have a header
writer.writerows(sorted(lines, key=lambda x: x[1]))
Using sys.stdout for output is slightly unconventional; obviously, adapt to suit your needs. The Python csv library documentation is obviously not designed primarily to be friendly for beginners, but it should not be impossible to figure out, and it's not hard to find examples of working code.
In Python, sorted() returns a copy of a list in sorted order. There is also sort() which sorts a list in-place. Both functions accept an optional keyword parameter to specify a custom sort order. To sort on the 4th and 17th fields, use
sorted(lines, key=lambda x: (x[3], x[16]))
(Python's indexing is zero-based, so [3] is the fourth element.)
To use | as a delimiter, specify delimiter='|' in the csv.reader() and csv.writer() calls. Unfortunately, Python doesn't easily let you use a multi-character delimiter, so you might have to preprocess the data to switch to a single-character delimiter which does not occur in the data, or properly quote the fields which contain the character you selected as your delimiter.

How to ls both alphabetically and by date last modified

So I was trying to do some research on it, but I could not find the answer. So I know that ls -l returns all things in the folder alphabetically, whilst ls -alt returns a list of files by their modification date, though without respect to alphabetical ordering.
I tried doing ls -l -alt, and also ls -alt -l, still no luck. What is the correct way to group them together?
Edit: With example.
Say I have the following list of directories:
aalexand bam carson duong garrett hollande jjackson ksmith mkumba olandt rcs solorzan truong yoo
aalfs battiste chae echo ghamilto holly jkelly kturner mls old.2016 reichman sophia twong zbib
I want to order them by alphabet, so say aalexand comes first. However, if aalfs has been modified last. So in other words has been changed more recently (not really sure how to structure this with proper grammar) it should appear first.
So if this were like a SQL query then we order by date last modified, group by directory name.
I am not sure what you want to do.
But, first of all: ls -l -alt is a double use of the -l parameter (take a look at man ls for more information about the parameters).
ls -l (l stands for list) just lists only one file per line (if you don't need the extra information like permissions, use -1 instead of -l). The -a includes hidden files. -t is for sorting by modified time. You cannot sort by name AND by time, except if two files would have the same name, which is not posible. Could you please explain your wish further?
Maybe you include a short example list of files including their modified time and your desired output, maybe then I can understand.

Issue in handling pipe delimited flat files,each field within double quotes. What can be a solution here

I have to handle pipe delimited flat files, in which each field comes within double quotes.
sample data:
"1193919"|"false"|""|"Mr. Andrew Christopher Alman"|""|""|"Mr."
I have written many gawk commands in my scripts. Now the issue is:
issue:
Consider this row: "1193919|false||Mr. Andrew Christopher Alman"|""|"Mr."
My script is taking the above as 6 different fields
"1193919
false
[null]
Mr. Andrew Christopher Alman"
[null]
"Mr."
But the data files are sent with the intent that
"1193919|false||Mr. Andrew Christopher Alman" should be taken as one field, as surrounded by double quotes.
My thought: I was thinking to change the field separator from | to "|"
This has few issues. The last and first fields will come as "1193919 and Mr."
i dont want to use '["][|]["]|^["]|["]$' as field separator, because this will increase the number fields and my other codes will have to go though a major change.
I am asking for a solution something like:
Use | as a field separator only if it is followed by " and preceded by ". But the field separator will be | and not "|"
issue 2:
"1193919""|"false"""|""|"Mr. Andrew Christopher Alman"
At the same time I want to report an error for "false""", something like /^"["]+ | ["]+["]$/ and not /^""$/
Good data should be in below format
"1193919"|"false"|""|"Mr. Andrew Christopher Alman"
you can use gawk's FPAT variable to define quoted fields
$ gawk -v FPAT='[^|]*|"[^"]*"' '{print $1}'
and add your logic around the number of field etc.
The main idea is to handle all irregularities before awk (because many irregular cases are possible and awk works best on regular files).
You can replace specific patterns with a unique symbol that doesn't occur within fields and then use it as a field delimiter:
sed 's/"|"/"\t"/g' file.txt |\
awk -F '\t' '{for(i = 1; i <= NF; i++){print i, $i} }'
I'd use something that is highly unlikely to occur in a text, e.g. vertical tab \v. If you are not sure about contents of the fields, then you can determine a symbol that is not present in the current chunk of data and process it with this symbol as a delimiter.
The same approach works for issue 2. If you know that some patterns are incorrect, then you can either exclude or fix them before processing, e.g. with
sed 's/\([^|"]\)"\+|/\1"|/g'

Extracting a subset data of Freebase for faster development iteration

I have downloaded the 250G dump of freebase data. I don't want to iterate my development on the big data. I want to extract a small subset of the data (may be a small domain or some 10 personalities and their information). This small subset will make my iterations faster and easier.
What's the best approach to partition the freebase data?
Is there any subset download provided by Google/Freebase?
This is feedback that we've gotten from many people using the data dumps. We're looking into how best to create such subsets. One approach would be to get all the data for a single domain like Film.
Here's how you'd get every RDF triple from the /film domain:
zgrep '\s<http://rdf\.freebase\.com/ns/film.' freebase-rdf-{date}.gz | gzip > freebase-films.gz
The tricky part is that this subset won't contain the names, images or descriptions which you most likely also want. So you'll need to get those like this:
zgrep '\s<http://rdf\.freebase\.com/ns/(type\.object|common\.topic)' freebase-rdf-{date}.gz | gzip > freebase-topics.gz
Then you'll possibly want to filter that subset down to only topic data about films (match only triples that start with the same /m ID) and concatenate that to the film subset.
It's all pretty straight-forward to script this with regular expressions but a lot more work than it should be. We're working on a better long-term solution.
I wanted to do a similar thing and I came up with the following command line.
gunzip -c freebase-rdf-{date}.gz | awk 'BEGIN { prev_1 = ""} { if (prev_1 != $1) { print '\n' } print $0; prev_1 = $1};' | awk 'BEGIN { RS=""} $0 ~ /type\.object\.type.*\/film\.film>/' > freebase-films.txt
It will give you all the triplets for all subjects that has the type film. (it assumes all subjects come in sorted order)
After this you can simply grep for the predicates that you need.
Just one remark for accepted post, variant for topics don't work for me, because if we want use regex we need to set -E parameter
zgrep -E '\s<http://rdf\.freebase\.com/ns/(type\.object|common\.topic)' freebase-rdf-{date}.gz | gzip > freebase-topics.gz

Need to extract data from log files and print to another file, then look for uniqueness

I have data from http access logs that I need to do the following:
Search for the pattern in all files in a specific directory
Write that data to another file
Check new file for uniqueness and remove duplicate entries
Data looks like this:
<IP address> - - [09/Sep/2012:17:35:39 +0000] "GET /api/v1/user/followers?user_id=577670686&access_token=666507ba-8e88-423b-83c6-9df44bee2c8b& HTTP/1.1" 200 172209 <snip>
I'm particularly interested in the numeric part of: user_id=577670686, which I would like to print to a new file (I haven't yet tried that part yet)...
I've tried to use sed, but I'm not really trying to manipulate the data, so it seems incredibly clumsy....looked at awk, but the data isn't really column-based and the $# designations didn't work for this data (it would be in $10, right?) And, I couldn't see a way to get rid of the portion of data that results from using $#. It was suggested that I use perl, so I've looked at examples in google, but it's so foreign to me. Any suggestions?
Use sed to extract relevant part, then sort an uniq pair to report:
$ sed -r 's/.*user_id=([0-9]+)&.*/\1/' access.log | sort | uniq -c
This will print all unique user_id values together with the total number of occurrences.

Resources