How to map mid of notable types of freebase to the type name? - freebase

I want to use freebase dump to query notable types. But I can only get the machine id of types like "m.0kpv11". Is there a way to map it to real name?

The Freebase RDF dumps include a significant amount of redundancy, so there usually display names, in many languages, near the ID. e.g.
<http://rdf.freebase.com/ns/g.11b764z9c9> <http://rdf.freebase.com/ns/common.notable_for.display_name> "Musical Recording"#en .
<http://rdf.freebase.com/ns/g.11b764z9c9> <http://rdf.freebase.com/ns/common.notable_for.display_name> "Musikalspår"#sv .
<http://rdf.freebase.com/ns/g.11b764z9c9> <http://rdf.freebase.com/ns/common.notable_for.object> <http://rdf.freebase.com/ns/m.0kpv11> .
but anything used as an object (3rd column in the triple) will also have a set of triples with the same ID as a subject (ie 1st column), so you can look for:
<http://rdf.freebase.com/ns/m.0kpv11> <http://rdf.freebase.com/ns/type.object.name> "Musical Recording"#en .
<http://rdf.freebase.com/ns/m.0kpv11> <http://www.w3.org/2000/01/rdf-schema#label> "Musical Recording"#en .
In addition to the #en for English labels, there are also labels available in a total of 44 languages (for this example anyway - it can vary). Some examples:
<http://rdf.freebase.com/ns/m.0kpv11> <http://rdf.freebase.com/ns/type.object.name> "Pista musical"#es .
<http://rdf.freebase.com/ns/m.0kpv11> <http://rdf.freebase.com/ns/type.object.name> "Canción"#es-419 .
<http://rdf.freebase.com/ns/m.0kpv11> <http://rdf.freebase.com/ns/type.object.name> "Muusikapala"#et .
<http://rdf.freebase.com/ns/m.0kpv11> <http://rdf.freebase.com/ns/type.object.name> "Μουσικό κομμάτι"#el .

Related

Extracting all different options of references from pdf document in R with regex (multiple options/capture groups?)

I am trying to clean some pdf documents for text analysis. I am trying to grab all the references on the text and remove them. My problem is, that there are so many options to cite...
My documents are split up into single lines.
I have a working regex, that only captures the standard format
a) Author (year), something .
"Author, firstname, someone, else (1996), something: Analysis, Paris.\r"
I want option a,
b) Author (year(character)), something .
"Author, firstname, someone, else (1996a), something: Analysis, Paris.\r"
c) Author (forthcoming), something .
"Author, firstname, someone, else (forthcoming), something: Analysis, Paris.\r"
d) Author/s (eds.) (year), ....
"Author, firstname, someone, else (eds.) (1996), something: Analysis, Paris.\r"
e) Author (n.d.), ....
"Author, firstname, someone, else (n.d.), something: Analysis, Paris.\r"
I have found all of those in my documents... There might be options I have not found yet, so if you have examples or something that grabs that as well, I'm grateful for every it of help.
The working code is the following:
[ ]*[A-Z].*\([0-9]{4}\),[[:space:]][“A-Z]
My latest try is this:
[ ]*[A-Z].*(\([a-z]{3,4}\.?\))?(\([0-9]{4}[a-z]?\))?(\(forthcoming\))?,[[:space:]][“A-Z]
I tried to make as many pieces optional as I could, but now it grabs too much.
I expect a list of all the References the regex finds, if possible with all the options. At the moment it grabs not enough (first case) or too much (second case).
My latest try is this:
[ ]*[A-Z].*(\([a-z]{3,4}\.?\))?(\([0-9]{4}[a-z]?\))?(\(forthcoming\))?,[[:space:]][“A-Z]
I tried to make as many pieces optional as I could, but now it grabs too much.
You almost perfectly made up the three option pieces, but since you made them all optional, the expression matches even if none of them is present. Better use the alternation operator |, which requires one subexpression piece to match, i. e. instead of X?Y?Z? write (X|Y|Z); this makes:
[ ]*[A-Z].*(\([.a-z]{3,4}\.?\)|\([0-9]{4}[a-z]?\)|\(forthcoming\)),[[:space:]][“A-Z]
(Note that I changed the first [a-z] to [.a-z] in order to also cover the (n.d.) case.)

Biztalk Flat File--Ignoring Headers and Footers

I believe I have a general understanding on the steps on how to do this, but am struggling to get the schemas correct, either using the Flat File Schema Generator or tweaking the config afterwords.
I will give a sample of the data below, but in general, it starts with a multi-line header that can have variable text but always ends with the same exact line ("START-OF-DATA"). The next section consists of rows of delimited data (this is the only part of the file I need to bring into Biztalk). Finally, there is a multi-line footer that always has the same start end end line ("END-OF-Data" and "END-OF-FILE").
Sample--my comments are in parens:
START-OF-FILE (this is always here)
(. . . variable number of lines that contain info I don't need . . .)
START-OF-DATA (this is always here)
(many lines of delimited data that I DO need)
END-OF-DATA (this is always here)
(. . . variable number of lines that contain info I don't need . . .)
END-OF-FILE (this is always here)
I have used the flat file generator to create three schema (header/detail/footer) with the intent to map only the detail. I created a pipeline and assigned the three schemas to the disassembly stage.
I am looking for general tips on what may be wrong with my approach, or what I should be looking out for. However the error I get when running this is:
The trailer specification specified on the pipeline component
properties does not contain an interchange trailer.
I have googled this error and (as suggested) tried to change the Child order from Infix to Postfix, but this didn't help.
I think this blog should help you:
http://maddcoder.wordpress.com/2012/06/14/using-biztalk-to-parse-a-flatfile-with-multi-line-header-and-trailers/

Dictionary File Structure of Open Spell-Checkers

Is there any explanation docs or tutorials of the file structure of FreeDict, Aspell, Hunspell/OpenOffice Dictionaries especially concerning the switches at the end of each row in each .dic file? My guess is that the switches describe the semantic interpretation of the word whether it's a
noun
adjective
adverb
adverbial
etc.
or any combination of the above. But I don't know how to match these to the switch characters.
I'm also curios about what the .aff file describes.
This looks like a good starting point, and the downloads at this page may have the format documentation you're looking for.
Just a couple of links that might help you:
this is on sthackoverflow :
What's the format of the OpenOffice dictionaries?
this second one is a good start
http://sourceforge.net/apps/mediawiki/freedict/index.php?title=Main_Page
hope this helps
In Hunspell the tags you choose are arbitrary, they have no meaning other than that which you assign to them. You can choose from using letters, numbers (1-65535) and more.
The affix file describes many things, but is mainly concerned with how words are inflected.
For example:
$ test.dic
4
apple/a
banana/a
green/b
small/b
$ test.aff
SFX a Y 2 # Allow the following 2 suffixes to words with the "a" flag.
SFX a 0 s . # An "s" at the end for words ending in any letter (signified by the dot). "Apples" and "bananas".
SFX a 0 s' . # "Apples'" and "bananas'".
SFX b Y 2
SFX b 0 er . # "Greener" and "smaller".
SFX b 0 est . # "Greenest" and "smallest".
The manual explains most of the things in detail. There are also test files one can look at.

drupal input filter changes only one pattern

my custom drupal module provides a custom input filter, the function is below:
function my_custom_filter($text) {
return preg_replace('~<img(.*)src=\"/sites/default/files/(.*)\"~', '<img$1src="' . variable_get('static_url', "http://fileserver.com") ."/". file_directory_path() . "/" . '$2' . "\"", $text);
}
as you can see, i use the module for an cdn fileserver change for the images entered in rte (i use tinymce).
The problem is, my filter only changes the last pattern of the given text. I don't understand why this happens, any ideas?
I think the problem is that your .* is being too greedy, and selecting most of the text - from the first '<img' to the last 'src='
Try adding the pcre pattern modifier U (that's capital U) after the second pattern ~. That will invert the .* parts of the pattern to become ungreedy, and to match as few characters as possible.

"Variable variable" syntax

This is a question related to getting Drupal CCK fields (just in case that happens to change anything).
I have several Drupal CCK fields with similar names. They have the same name with a number at the end. that I'd like to pull values from these fields (ten fields total). This is the syntax for accessing the fields values:
$node->cck_field_1[0]['value']
$node->cck_field_2[0]['value']
$node->cck_field_3[0]['value']
…etc.
Since they're all separate fields, but they're numbered, I'd like to just loop through incrementally to write out what I need (there's a lot more to what I'm writing than just accessing these fields' data, but they're the determining factors of the rest), but I can't figure out how to insert a variable into that part of the code.
e.g., (if $i were the incremental number variable), I'd like to be able to write the following string as a variable:
'$node->cck_field_' . $i . '[0]["value"]'
I understand about using the curly brackets to create a variable name from a string, but the part I need the variable in needs to be outside of the string. e.g. this works:
${node}->cck_field_1[0]['value']
but this doesn't:
${node->cck_field_1}[0]['value']
(so I can't write ${'node->cck_field'.$i}[0]['value'] )
So how can write this so that I can use $i in place of the number?
This should work:
$node->{'cck_field_' . $i}[0]['value']

Resources