Extracting a subset data of Freebase for faster development iteration - freebase

I have downloaded the 250G dump of freebase data. I don't want to iterate my development on the big data. I want to extract a small subset of the data (may be a small domain or some 10 personalities and their information). This small subset will make my iterations faster and easier.
What's the best approach to partition the freebase data?
Is there any subset download provided by Google/Freebase?

This is feedback that we've gotten from many people using the data dumps. We're looking into how best to create such subsets. One approach would be to get all the data for a single domain like Film.
Here's how you'd get every RDF triple from the /film domain:
zgrep '\s<http://rdf\.freebase\.com/ns/film.' freebase-rdf-{date}.gz | gzip > freebase-films.gz
The tricky part is that this subset won't contain the names, images or descriptions which you most likely also want. So you'll need to get those like this:
zgrep '\s<http://rdf\.freebase\.com/ns/(type\.object|common\.topic)' freebase-rdf-{date}.gz | gzip > freebase-topics.gz
Then you'll possibly want to filter that subset down to only topic data about films (match only triples that start with the same /m ID) and concatenate that to the film subset.
It's all pretty straight-forward to script this with regular expressions but a lot more work than it should be. We're working on a better long-term solution.

I wanted to do a similar thing and I came up with the following command line.
gunzip -c freebase-rdf-{date}.gz | awk 'BEGIN { prev_1 = ""} { if (prev_1 != $1) { print '\n' } print $0; prev_1 = $1};' | awk 'BEGIN { RS=""} $0 ~ /type\.object\.type.*\/film\.film>/' > freebase-films.txt
It will give you all the triplets for all subjects that has the type film. (it assumes all subjects come in sorted order)
After this you can simply grep for the predicates that you need.

Just one remark for accepted post, variant for topics don't work for me, because if we want use regex we need to set -E parameter
zgrep -E '\s<http://rdf\.freebase\.com/ns/(type\.object|common\.topic)' freebase-rdf-{date}.gz | gzip > freebase-topics.gz

Related

Bash/R searching columns in a huge table

I have a huge table I want to extract information from. Firstly, I want to extract a certain line based on a pattern -> I've done that successfully with grep. However this line has loads of columns and I'm interested only in a couple of them that have a certain pattern in them (partial match - beginning of the string). Is it possible to extract only the columns and the number of the column (the nth column) for some partial matches? Hope I was clear enough.
Languages: Preferably in bash but I can also work in R, alternatively I'm open to suggestions if you think another language can be more helpful.
Thanks!
Awk is perfect for stuff like this. To help you write a script I think we need more details. But I'm guessing you'll want to use the print feature of awk. To print out the nth column of a file "your_file" do:
awk '{print $n}' your_file
In solving your problem you may also want to loop over all N columns which you can do via:
for i in {1..N} ;
do
awk -v col=${i} '{print $col}' your_file ;
done

nested for loop too slow: 1MN record traversal

I've huge file count, around 200,000 records in a file. I have been testing some cases where in I have to figure out the naming pattern of the files match to some specific strings. Here's how I preceded-
Test Strings, I stored in a file (let's say for one case, they are 10). The actual file that contains string records, separated by newline; totaling upto 200,000 records. To check if the test string patterns are present in the large file, I wrote a small nested for loop.
for i in `cat TestString.txt`
do
for j in `cat LargeFile.txt`
do
if [[ $i == $j ]]
then
echo "Match" >> result.txt
fi
done
done
This nested loop actual has to do the traversal (if I'm not wrong in the concepts), 10x200000 times. Normally I don't see that's too much of a load on the server, but the time taken is like all along. The excerpt is running for the past 4 hours, with ofcourse some "matched" results.
Does anyone has any idea on speeding this up? I've found so many answers with python or perl touch, but I'm honestly searching for something in Unix.
Thanks
Try the following:
grep -f TestString.txt LargeFile.txt >> result.txt
Check out grep
while read line
do
cat LargeFile.txt | grep "$line" >> result.txt
done < TestString.txt
grep will output any matching strings. This may be faster. Note that your TestString.txt file should not have any blank lines or grep will return everything from LargeFile.txt.

Facing issue while invoking grep inside awk command

I am looking for extracting some information from log using awk and based on the information returned i want to grep the whole file and write all the output from gerp and awk to a file. I was able to extract few information form awk but while using grep inside awk i am not able to extract information. Please find the logs as follow.
2014-04-10 13:55:59,837 [WebContainer : 4] [com.cisco.ata.service.AtAService] WARN - AtAService::AtAServiceRequest DetailMessage - module=ataservice;service=ataservicerequest;APP_ID=CDCSDSATAUser.gen;VIEW_NAME=/EntitlementView[CCOID="frhocevar"]REQUEST_ID_STRING=-105411838, took 100 milliseconds.
Based on the REQUEST_ID_STRING i have to get usecaseID.
2014-04-10 13:55:59,800 [Thread-66] [com.cisco.ata.cla.CLAManager] INFO - CLAManager.getAttributeFromCLAMapping() took 6 ms, for useCaseID - UC41, condition= (CCOID=frhocevar), requestID= -105411838
i am extracting REQUEST_ID_STRING using awk but i am not able to extract "useCaseID" using grep.
Below is the command i am using.
grep -i -r 'AtAService::AtAServiceRequest DetailMessage - module=ataservice;service=ataservicerequest' /opt/httpd/logs/apps/atasvc/prod1/was70/*/*.log* |
awk 'BEGIN{count=0;}{if($14>1000){print $0}}' |
awk 'BEGIN{ FS=";"}
{a = substr($3,8)}
{b = substr($4,index($4,"/")+1,index($4,"]R")-index($4,"/"))}
{c = substr($4,index($4,"G=")+2,index($4,", took")-index($4,"G=")-2);}
{d = substr($1,0,index($1,":")-1)}
{e=grep command which will extract usecaseid from $d having file name}
{ print a","b","c","d","e} '
Please help me in this issue.
Thanks in advance
I'm incredibly tired, so this is likely not the best solution, but it uses some basic "awkisms" that make for some pretty good boilerplate starting points for a lot of stuff.
AirBoxOmega:~ d$ cat log
2014-04-10 13:55:59,837 [WebContainer : 4] [com.cisco.ata.service.AtAService] WARN - AtAService::AtAServiceRequest DetailMessage - module=ataservice;service=ataservicerequest;APP_ID=CDCSDSATAUser.gen;VIEW_NAME=/EntitlementView[CCOID="frhocevar"]REQUEST_ID_STRING=-105411838, took 100 milliseconds.
2014-04-10 13:55:59,800 [Thread-66] [com.cisco.ata.cla.CLAManager] INFO - CLAManager.getAttributeFromCLAMapping() took 6 ms, for useCaseID - UC41, condition= (CCOID=frhocevar), requestID= -105411838
AirBoxOmega:~ d$ cat stackHelp.awk
{
if ($0 ~ /AtAService::AtAServiceRequest DetailMessage/ && $(NF - 1) > 99) {
split($0, tmp, "[-,]")
slow[tmp[7]]++
}
if (slow[substr($NF,2)]) {
split($0, tmp, "[-,]")
print $NF tmp[8]
}
}
AirBoxOmega:~ d$ gawk -f stackHelp.awk log
-105411838 UC41
This uses a pretty basic awk concept where if you know that there is something common among your log lines (a sessionID, or some such) you create an array for that based on certain conditions (in this case that the log line contains a given string and that the next to last column is > 99). Then later when you run into that same sessionID, you can check to see if an array exists for it, and if so, pull out even more info.
You may need/want to add something to the second if statement so it's only checking log lines you care about, but honestly, awk is so fast that it probably won't matter. (I'm using gawk [via brew] as the version of awk that comes with OSX is somewhat lacking, but this code is basic enough that awk or gawk should work.)
If you need a better explanation of the code, I'll try to explain better.
Ninja edit: Few quit tips:
Don't use grep -i unless you really don't know the case you're looking for. Case insensitivity will make your searches MUCH slower
If you're not using any kind of regular expressions, use fgrep instead of grep. It's much faster out the box.
Learn how to ask questions efficiently. Your question was pretty clear, but use tags to make the log lines more readable, and remember that every technical question should include:
What your input is
What your output should be
What you tried
What you expected
What you got
Get good at awk. The world is slowly moving away from command line centric stuff, and people may say it's not worth it, but once you understand basic concepts in awk, it's easy to apply them elsewhere, be it python, log utilities, or just thinking in terms of data aggregation.

Is unix join strictly lexical?

I have to process many ('00s) two-column delimited files that are numerically sorted by their first column (a long int that can range from 857 to 293823421 for example).
The processing is simple enough: iterate through a loop to left-join the files using one of them as 'anchor' (the 'left' file in the join), using join's -e and -o options to fill in the NULLs.
Question: is there any way join (from Core Utils 8.13) can process these joins as-is, or must I add a sort -k1,1 step to ensure lexical order prior to each join ?
Everything I've read searching this tells me I have to, but I wanted to make sure I wasn't missing some clever trick to avoid the extra sorting. Thank you.
Indeed, join does not support numeric comparisons. However, from your description, it sounds like you can convert your first field into an already-string-sorted form by zero-padding it, and then convert it back by de-zero-padding it. For example, here is a function that performs a join -e NULL on two files that match your description (as I understand it):
function join_by_numeric_first_field () {
local file1="$1"
local file2="$2"
join -e NULL <(awk '{printf("%020d\t%s\n", $1, $2)}' "$file1") \
<(awk '{printf("%020d\t%s\n", $1, $2)}' "$file2") \
| awk '{printf("%d\t%s\n", $1, $2)}'
}
(The awk '{printf("%020d\t%s\n", $1, $2)}' reads each line of a two-column input and re-prints out the two columns, separated by a tab, but treating the first column as a decimal integer and zero-padding it out to twenty characters. The awk '{printf("%d\t%s\n", $1, $2)}' does the same thing, except that it doesn't zero-pad the decimal integer, so it has the effect of stripping out any zero-padding that was there.)
Whether this is a better approach than sort-ing will depend on the size of your files, and on how flexible you need to be in supporting files that don't quite match your description. This approach scales linearly with the file-size, but is significantly more complicated, and is also a bit more fragile, in that the awk commands expect a pretty specific input-format. The sort approach is much simpler, but will not perform as well for large files.

grep -f alternative for huge files

grep -F -f file1 file2
file1 is 90 Mb (2.5 million lines, one word per line)
file2 is 45 Gb
That command doesn't actually produce anything whatsoever, no matter how long I leave it running. Clearly, this is beyond grep's scope.
It seems grep can't handle that many queries from the -f option. However, the following command does produce the desired result:
head file1 > file3
grep -F -f file3 file2
I have doubts about whether sed or awk would be appropriate alternatives either, given the file sizes.
I am at a loss for alternatives... please help. Is it worth it to learn some sql commands? Is it easy? Can anyone point me in the right direction?
Try using LC_ALL=C . It turns the searching pattern from UTF-8 to ASCII which speeds up by 140 time the original speed. I have a 26G file which would take me around 12 hours to do down to a couple of minutes.
Source: Grepping a huge file (80GB) any way to speed it up?
So what I do is:
LC_ALL=C fgrep "pattern" <input >output
I don't think there is an easy solution.
Imagine you write your own program which does what you want and you will end up with a nested loop, where the outer loop iterates over the lines in file2 and the inner loop iterates over file1 (or vice versa). The number of iterations grows with size(file1) * size(file2). This will be a very large number when both files are large. Making one file smaller using head apparently resolves this issue, at the cost of not giving the correct result anymore.
A possible way out is indexing (or sorting) one of the files. If you iterate over file2 and for each word you can determine whether or not it is in the pattern file without having to fully traverse the pattern file, then you are much better off. This assumes that you do a word-by-word comparison. If the pattern file contains not only full words, but also substrings, then this will not work, because for a given word in file2 you wouldn't know what to look for in file1.
Learning SQL is certainly a good idea, because learning something is always good. It will hovever, not solve your problem, because SQL will suffer from the same quadratic effect described above. It may simplify indexing, should indexing be applicable to your problem.
Your best bet is probably taking a step back and rethinking your problem.
You can try ack. They are saying that it is faster than grep.
You can try parallel :
parallel --progress -a file1 'grep -F {} file2'
Parallel has got many other useful switches to make computations faster.
Grep can't handle that many queries, and at that volume, it won't be helped by fixing the grep -f bug that makes it so unbearably slow.
Are both file1 and file2 composed of one word per line? That means you're looking for exact matches, which we can do really quickly with awk:
awk 'NR == FNR { query[$0] = 1; next } query[$0]' file1 file2
NR (number of records, the line number) is only equal to the FNR (file-specific number of records) for the first file, where we populate the hash and then move onto the next line. The second clause checks the other file(s) for whether the line matches one saved in our hash and then prints the matching lines.
Otherwise, you'll need to iterate:
awk 'NR == FNR { query[$0]=1; next }
{ for (q in query) if (index($0, q)) { print; next } }' file1 file2
Instead of merely checking the hash, we have to loop through each query and see if it matches the current line ($0). This is much slower, but unfortunately necessary (though we're at least matching plain strings without using regexes, so it could be slower). The loop stops when we have a match.
If you actually wanted to evaluate the lines of the query file as regular expressions, you could use $0 ~ q instead of the faster index($0, q). Note that this uses POSIX extended regular expressions, roughly the same as grep -E or egrep but without bounded quantifiers ({1,7}) or the GNU extensions for word boundaries (\b) and shorthand character classes (\s,\w, etc).
These should work as long as the hash doesn't exceed what awk can store. This might be as low as 2.1B entries (a guess based on the highest 32-bit signed int) or as high as your free memory.

Resources