Solr : Autolink on body from words dictionnary - dictionary

I'm looking for to generate auto link in body result in solr. Words on link must be in a dictionnary.
For example :
a doc :
<doc>
[...]
<str name="title">Il faut, quand on gouverne, voir les hommes tels qu’ils sont, et les choses telles qu’elles devraient être.</str>
<str name="path">citation/faut-gouverne-voir-hommes-tels-choses-telles-devraient-etre-15.php</str>
<str name="ss_field_citation_keywords">#faut#gouverne#voir#hommes#tels#choses#telles#devraient#etre#</str>
[...]
</doc>
Body from title to display :
Il faut, quand on gouverne, voir les hommes tels qu’ils sont, et les choses telles qu’elles devraient être.
Links from ss_field_citation_keywords :
#faut#gouverne#voir#hommes#tels#choses#telles#devraient#etre#
Body must be display like this :
Il faut, quand on gouverne, voir les hommes tels qu’ils sont, et les choses telles qu’elles devraient être.
Il faut, quand on gouverne, voir les hommes tels qu’ils sont, et les choses telles qu’elles devraient être
Do you have any idea?

You have two phases here:
Identify the keywords. For this you want to build your analyzer chain properly. Whitespace tokenizer, lowercase filter and - that's the key part - KeepWordFilterFactory . This will make Solr keep only your keywords with offsets in the text.
Get those offsets. There is might be several ways, but one of them is to reuse Field Analyzer which you can experiment with in admin WebUI of latest (4+) Solr. Make sure to check the Verbose box. That uses /analysis/field end point and you can use it too (with verbose flag). The result is probably too verbose for you but good enough to start. Then you can look for better implementation or copy/reduce the one currently done.

a proposal for internal processing with velocity and a java class
public class autoLinkCitationDirective extends Directive{
public String getName() {
return "autolinkcitation";
}
public int getType() {
return LINE;
}
public boolean render(InternalContextAdapter context, Writer writer, Node node)
throws IOException, ResourceNotFoundException, ParseErrorException, MethodInvocationException {
String CitationMe = null;
String KeyWords = null;
String SchemaUrl = null;
//params
if (node.jjtGetChild(0) != null) {
CitationMe = String.valueOf(node.jjtGetChild(0).value(context));
}
if (node.jjtGetChild(1) != null) {
KeyWords = String.valueOf(node.jjtGetChild(1).value(context));
}
//schema url
if (node.jjtGetChild(2) != null) {
SchemaUrl = String.valueOf(node.jjtGetChild(2).value(context));
}
writer.write(autoLinkCitation(CitationMe, KeyWords, SchemaUrl));
return true;
}
public String autoLinkCitation(String CitationMe, String KeyWords, String SchemaUrl) {
if (CitationMe == null) {
return null;
}
List<String> tokens = new ArrayList<String>();
StringTokenizer stkKeyWords = new StringTokenizer(KeyWords, "#");
while ( stkKeyWords.hasMoreTokens() ) {
tokens.add(stkKeyWords.nextToken());
}
String patternString = "\\b(" + StringUtils.join(tokens, "|") + ")\\b";
Pattern pattern = Pattern.compile(patternString);
String strippedHtml = CitationMe.replaceAll("<(.|\n)*?>", "");
StringTokenizer st = new StringTokenizer(strippedHtml, ".,! ()[]");
while (st.hasMoreTokens())
{
String token = st.nextToken().trim();
if (token.length() > 3)
{
Matcher matcher = pattern.matcher(cleanString(token));
while (matcher.find()) {
if(CitationMe.indexOf( SchemaUrl + cleanString(token) + "'") == -1)
{
String tmpStringreplacement = "<a href='" + SchemaUrl + cleanString(token) + "'>"+token+"</a>";
CitationMe = CitationMe.replaceAll("\\b"+token+"\\b(?!/)",tmpStringreplacement);
}
}
}
}
return CitationMe;
}
public String cleanString(String CleanStringMe) {
if (CleanStringMe == null) {
return null;
}
CleanStringMe = Normalizer.normalize(CleanStringMe, Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
CleanStringMe = CleanStringMe.toLowerCase();
CleanStringMe = CleanStringMe.replaceAll("[^A-Za-z0-9]", "-");
return CleanStringMe;
}
}
and to display:
#autolinkcitation($doc.getFieldValue('body'),$doc.getFieldValue('ss_field_citation_keywords'), '/citations/mot.php?mot=' )

Related

Xunit CSV streamReader.ReadToEnd returns System.ArgumentOutOfRangeException

I would like to evaluate a CSV data series with Xunit.
For this I need to read in a string consisting of int, bool, double and others.
With the following code, the transfer basically works for one row.
But since I want to test for predecessor values, I need a whole CSV file for evaluation.
My [Theory] works with InlineData without errors.
But when I read in a CSV file, the CSVDataHandler gives a System.ArgumentOutOfRangeException!
I can't find a solution for the error and ask for support.
Thanks a lot!
[Theory, CSVDataHandler(false, "C:\\MyTestData.txt", Skip = "")]
public void TestData(int[] newLine, int[] GetInt, bool[] GetBool)
{
for (int i = 0; i < newLine.Length; i++)
{
output.WriteLine("newLine {0}", newLine[i]);
output.WriteLine("GetInt {0}", GetInt[i]);
output.WriteLine("GetBool {0}", GetBool[i]);
}
}
[DataDiscoverer("Xunit.Sdk.DataDiscoverer", "xunit.core")]
[AttributeUsage(AttributeTargets.Method, AllowMultiple = true, Inherited = true)]
public abstract class DataArribute : Attribute
{
public abstract IEnumerable<object> GetData(MethodInfo methodInfo);
public virtual string? Skip { get; set; }
}
[AttributeUsage(AttributeTargets.Method, AllowMultiple = false, Inherited = false)]
public class CSVDataHandler : DataAttribute
{
public CSVDataHandler(bool hasHeaders, string pathCSV)
{
this.hasHeaders = hasHeaders;
this.pathCSV = pathCSV;
}
public override IEnumerable<object[]> GetData(MethodInfo methodInfo)
{
var methodParameters = methodInfo.GetParameters();
var paramterTypes = methodParameters.Select(p => p.ParameterType).ToArray();
using (var streamReader = new StreamReader(pathCSV))
{
if (hasHeaders) { streamReader.ReadLine(); }
string csvLine = string.Empty;
// ReadLine ++
//while ((csvLine = streamReader.ReadLine()) != null)
//{
// var csvRow = csvLine.Split(',');
// yield return ConvertCsv((object[])csvRow, paramterTypes);
//}
// ReadToEnd ??
while ((csvLine = streamReader.ReadToEnd()) != null)
{
if (Environment.NewLine != null)
{
var csvRow = csvLine.Split(',');
yield return ConvertCsv((object[])csvRow, paramterTypes); // System.ArgumentOutOfRangeException
}
}
}
}
private static object[] ConvertCsv(IReadOnlyList<object> cswRow, IReadOnlyList<Type> parameterTypes)
{
var convertedObject = new object[parameterTypes.Count];
for (int i = 0; i < parameterTypes.Count; i++)
{
convertedObject[i] = (parameterTypes[i] == typeof(int)) ? Convert.ToInt32(cswRow[i]) : cswRow[i]; // System.ArgumentOutOfRangeException
convertedObject[i] = (parameterTypes[i] == typeof(double)) ? Convert.ToDouble(cswRow[i]) : cswRow[i];
convertedObject[i] = (parameterTypes[i] == typeof(bool)) ? Convert.ToBoolean(cswRow[i]) : cswRow[i];
}
return convertedObject;
}
}
MyTestData.txt
1,2,true,
2,3,false,
3,10,true,
The first call to streamReader.ReadToEnd() will return the entire contents of the file in a string, not just one line. When you call csvLine.Split(',') you will get an array of 12 elements.
The second call to streamReader.ReadToEnd() will not return null as your while statement appears to expect, but an empty string. See the docu at
https://learn.microsoft.com/en-us/dotnet/api/system.io.streamreader.readtoend?view=net-7.0
If the current position is at the end of the stream, returns an empty
string ("").
With the empty string, the call to call csvLine.Split(',') will return an array of length 0, which causes your exception when its first element (index 0) is accessed.
All of this could have been easily discovered by simply starting the test in a debugger.
It looks like you have some other issues here as well.
I don't understand what your if (Environment.NewLine != null) is intended to do, the NewLine property will never be null but should have one of the values "\r\n" or "\n" so the if will always be taken.
The parameters of your test method are arrays int[] and bool[], but you are checking against the types int, double and bool in your ConvertCsv method, so the alternative cswRow[i] will always be returned. You'll wind up passing strings to your method expecting int[] and bool[] and will at latest get an error there.
This method reads a data series from several rows and columns and returns it as an array for testing purposes.
The conversion of the columns can be adjusted according to existing pattern.
Thanks to Christopher!
[AttributeUsage(AttributeTargets.Method, AllowMultiple = false, Inherited = false)]
public class CSVDataHandler : Xunit.Sdk.DataAttribute
{
public CSVDataHandler(string pathCSV)
{
this.pathCSV = pathCSV;
}
public override IEnumerable<object[]> GetData(MethodInfo methodInfo)
{
List<int> newLine = new();
List<int> GetInt = new();
List<bool> GetBool = new();
var reader = new StreamReader(pathCSV);
string readData = string.Empty;
while ((readData = reader.ReadLine()) != null)
{
string[] split = readData.Split(new char[] { ',' });
newLine.Add(int.Parse(split[0]));
GetInt.Add(int.Parse(split[1]));
GetBool.Add(bool.Parse(split[2]));
// Add more objects ...
}
yield return new object[] { newLine.ToArray(), GetInt.ToArray(), GetBool.ToArray() };
}
}

How do i add record from data table

I would add record to my table with a form to fill out, but there is a error from ListerPF()
"Object reference not set to an instance of an object"
Here is the whole process:
await App.methode.AddNewPF(Label_NumeroNF.Text, DatePicker_Date.Date, Editor_LibelleNF.Text, Picker_TypeFrais.SelectedItem.ToString(), double.Parse(Entry_Quantite.Text), double.Parse(Entry_Tarif.Text), double.Parse(Entry_Montant.Text), double.Parse(Entry_MontantTotal.Text), CheckBox_CCEntreprise.IsChecked, int.Parse(Entry_Imputation.Text));
Navigation.PushAsync(new PostesNF());
And the code behind "AddNewPF" is:
public async Task AddNewPF(string numero , DateTime date, string libelle, string typeFrais, double quantite, double tarif, double montant, double montantTotal, bool carteCredit, int imputationCC)
{
int result = 0;
try
{
result = await connection.InsertAsync(new DB_PosteNF { Numero = numero, Date = date, Libelle = libelle, TypeFrais = typeFrais, Quantite = quantite, Tarif = tarif, Montant = montant, MontantTotal = montantTotal, CarteCredit = carteCredit, ImputationCC = imputationCC});
StatutMessage = $"{result} poste de frais ajouté : {numero} | {typeFrais} ";
}
catch (Exception ex)
{
StatutMessage = $"Impossible d'ajouter le poste de frais avec le numéro: {numero}. \nErreur : {ex.Message}";
}
}
When the "PosteNF" page appears, I have this code
protected override async void OnAppearing()
{
base.OnAppearing();
CollectionViewPF.ItemsSource = await App.methode.ListerPF();
}
the problematic code "ListerPF" is:
public async Task<List<DB_PosteNF>> ListerPF(DB_ListeNF datareceived)
{
try
{
string numero = datareceived.Numero;
//return await connection.Table<DB_PosteNF>().ToListAsync();
return await connection.Table<DB_PosteNF>().Where(x => x.Numero == numero).ToListAsync();
}
catch (Exception ex)
{
StatutMessage = $"Impossible d'afficher la liste des postes de frais. \nErreur : {ex.Message}";
}
return new List<DB_PosteNF>();
}
But when I go back to the "PostesNF" page, the registered data is not displayed. Thanks for your help !
Per your code snippets, the model is DB_PosteNF with 10 properties.The key is that when you retrieve the records from the data table, I think you need a ObservableCollection<DB_PosteNF> to receive these entities that read from the database.Below is the code snippets for your reference:
First, define a UserCollection using ObservableCollection
public ObservableCollection<User> UserCollection { get; set; }
Then, in OnAppearing method:
protected override async void OnAppearing()
{
base.OnAppearing();
UserDB db = await UserDB.Instance;
List<User> a = await db.GetUserAsync();
UserCollection = new ObservableCollection<User>(a);
userinfodata.ItemsSource = UserCollection;
}
PS: In my case, the mode is User.

How to send a zipped file to S3 bucket from Apex?

Folks,
I am trying to move data to s3 from Salesforce using apex class. I have been told by the data manager to send the data in zip/gzip format to the S3 bucket for storage cost savings.
I have simply tried to do a request.setCompressed(true); as I've read it compresses the body before sending it to the endpoint. Code below:
HttpRequest request = new HttpRequest();
request.setEndpoint('callout:'+DATA_NAMED_CRED+'/'+URL+'/'+generateUniqueTimeStampforSuffix());
request.setMethod('PUT');
request.setBody(JSON.serialize(data));
request.setCompressed(true);
request.setHeader('Content-Type','application/json');
But no matter what I always receive this:
<Error><Code>XAmzContentSHA256Mismatch</Code><Message>The provided 'x-amz-content-sha256' header does not match what was computed.</Message><ClientComputedContentSHA256>fd31b2b9115ef77e8076b896cb336d21d8f66947210ffcc9c4d1971b2be3bbbc</ClientComputedContentSHA256><S3ComputedContentSHA256>1e7f2115e60132afed9e61132aa41c3224c6e305ad9f820e6893364d7257ab8d</S3ComputedContentSHA256>
I have tried multiple headers too, like setting the content type to gzip/zip, etc.
Any pointers in the right direction would be appreciated.
I had a good amount of headaches attempting to do a similar thing. I feel your pain.
The following code has worked for us using lambda functions; you can try modifying it and see what happens.
public class AwsApiGateway {
// Things we need to know about the service. Set these values in init()
String host, payloadSha256;
String resource;
String service = 'execute-api';
String region;
public Url endpoint;
String accessKey;
String stage;
string secretKey;
HttpMethod method = HttpMethod.XGET;
// Remember to set "payload" here if you need to specify a body
// payload = Blob.valueOf('some-text-i-want-to-send');
// This method helps prevent leaking secret key,
// as it is never serialized
// Url endpoint;
// HttpMethod method;
Blob payload;
// Not used externally, so we hide these values
Blob signingKey;
DateTime requestTime;
Map<String, String> queryParams = new map<string,string>(), headerParams = new map<string,string>();
void init(){
if (payload == null) payload = Blob.valueOf('');
requestTime = DateTime.now();
createSigningKey(secretKey);
}
public AwsApiGateway(String resource){
this.stage = AWS_LAMBDA_STAGE
this.resource = '/' + stage + '/' + resource;
this.region = AWS_REGION;
this.endpoint = new Url(AWS_ENDPOINT);
this.accessKey = AWS_ACCESS_KEY;
this.secretKey = AWS_SECRET_KEY;
}
// Make sure we can't misspell methods
public enum HttpMethod { XGET, XPUT, XHEAD, XOPTIONS, XDELETE, XPOST }
public void setMethod (HttpMethod method){
this.method = method;
}
public void setPayload (string payload){
this.payload = Blob.valueOf(payload);
}
// Add a header
public void setHeader(String key, String value) {
headerParams.put(key.toLowerCase(), value);
}
// Add a query param
public void setQueryParam(String key, String value) {
queryParams.put(key.toLowerCase(), uriEncode(value));
}
// Create a canonical query string (used during signing)
String createCanonicalQueryString() {
String[] results = new String[0], keys = new List<String>(queryParams.keySet());
keys.sort();
for(String key: keys) {
results.add(key+'='+queryParams.get(key));
}
return String.join(results, '&');
}
// Create the canonical headers (used for signing)
String createCanonicalHeaders(String[] keys) {
keys.addAll(headerParams.keySet());
keys.sort();
String[] results = new String[0];
for(String key: keys) {
results.add(key+':'+headerParams.get(key));
}
return String.join(results, '\n')+'\n';
}
// Create the entire canonical request
String createCanonicalRequest(String[] headerKeys) {
return String.join(
new String[] {
method.name().removeStart('X'), // METHOD
new Url(endPoint, resource).getPath(), // RESOURCE
createCanonicalQueryString(), // CANONICAL QUERY STRING
createCanonicalHeaders(headerKeys), // CANONICAL HEADERS
String.join(headerKeys, ';'), // SIGNED HEADERS
payloadSha256 // SHA256 PAYLOAD
},
'\n'
);
}
// We have to replace ~ and " " correctly, or we'll break AWS on those two characters
string uriEncode(String value) {
return value==null? null: EncodingUtil.urlEncode(value, 'utf-8').replaceAll('%7E','~').replaceAll('\\+','%20');
}
// Create the entire string to sign
String createStringToSign(String[] signedHeaders) {
String result = createCanonicalRequest(signedHeaders);
return String.join(
new String[] {
'AWS4-HMAC-SHA256',
headerParams.get('date'),
String.join(new String[] { requestTime.formatGMT('yyyyMMdd'), region, service, 'aws4_request' },'/'),
EncodingUtil.convertToHex(Crypto.generateDigest('sha256', Blob.valueof(result)))
},
'\n'
);
}
// Create our signing key
void createSigningKey(String secretKey) {
signingKey = Crypto.generateMac('hmacSHA256', Blob.valueOf('aws4_request'),
Crypto.generateMac('hmacSHA256', Blob.valueOf(service),
Crypto.generateMac('hmacSHA256', Blob.valueOf(region),
Crypto.generateMac('hmacSHA256', Blob.valueOf(requestTime.formatGMT('yyyyMMdd')), Blob.valueOf('AWS4'+secretKey))
)
)
);
}
// Create all of the bits and pieces using all utility functions above
public HttpRequest createRequest() {
init();
payloadSha256 = EncodingUtil.convertToHex(Crypto.generateDigest('sha-256', payload));
setHeader('date', requestTime.formatGMT('yyyyMMdd\'T\'HHmmss\'Z\''));
if(host == null) {
host = endpoint.getHost();
}
setHeader('host', host);
HttpRequest request = new HttpRequest();
request.setMethod(method.name().removeStart('X'));
if(payload.size() > 0) {
setHeader('Content-Length', String.valueOf(payload.size()));
request.setBodyAsBlob(payload);
}
String finalEndpoint = new Url(endpoint, resource).toExternalForm(),
queryString = createCanonicalQueryString();
if(queryString != '') {
finalEndpoint += '?'+queryString;
}
request.setEndpoint(finalEndpoint);
for(String key: headerParams.keySet()) {
request.setHeader(key, headerParams.get(key));
}
String[] headerKeys = new String[0];
String stringToSign = createStringToSign(headerKeys);
request.setHeader(
'Authorization',
String.format(
'AWS4-HMAC-SHA256 Credential={0}, SignedHeaders={1},Signature={2}',
new String[] {
String.join(new String[] { accessKey, requestTime.formatGMT('yyyyMMdd'), region, service, 'aws4_request' },'/'),
String.join(headerKeys,';'), EncodingUtil.convertToHex(Crypto.generateMac('hmacSHA256', Blob.valueOf(stringToSign), signingKey))}
));
system.debug(json.serializePretty(request.getEndpoint()));
return request;
}
// Actually perform the request, and throw exception if response code is not valid
public HttpResponse sendRequest(Set<Integer> validCodes) {
HttpResponse response = new Http().send(createRequest());
if(!validCodes.contains(response.getStatusCode())) {
system.debug(json.deserializeUntyped(response.getBody()));
}
return response;
}
// Same as above, but assume that only 200 is valid
// This method exists because most of the time, 200 is what we expect
public HttpResponse sendRequest() {
return sendRequest(new Set<Integer> { 200 });
}
// TEST METHODS
public static string getEndpoint(string attribute){
AwsApiGateway api = new AwsApiGateway(attribute);
return api.createRequest().getEndpoint();
}
public static string getEndpoint(string attribute, map<string, string> params){
AwsApiGateway api = new AwsApiGateway(attribute);
for (string key: params.keySet()){
api.setQueryParam(key, params.get(key));
}
return api.createRequest().getEndpoint();
}
public class EndpointConfig {
string resource;
string attribute;
list<object> items;
map<string,string> params;
public EndpointConfig(string resource, string attribute, list<object> items){
this.items = items;
this.resource = resource;
this.attribute = attribute;
}
public EndpointConfig setQueryParams(map<string,string> parameters){
params = parameters;
return this;
}
public string endpoint(){
if (params == null){
return getEndpoint(resource);
} else return getEndpoint(resource + '/' + attribute, params);
}
public SingleRequestMock mockResponse(){
return new SingleRequestMock(200, 'OK', json.serialize(items), null);
}
}
}

Change css chips on validation error

I am working with primefaces and using the element "chips" to introduce multiple e-mails, i validate the format of them and im trying to change the style of the "bubbles" to ui-state-error if the validation is wrong
My chips:
<div class="ui-grid-col-1" style="margin-right: 5px;">
<p:chips id="chips" required="true" value="#{Contactos.lista_email}"
placeholder="Email" style="color: red;"
requiredMessage="ERROR: El campo 'Email' es obligatorio"
validator="ValidMail"/>
</div>
My validator:
#FacesValidator(value = "ValidMail")
public class validatorMail implements Validator{
private Pattern pattern;
private Matcher matcher;
private static final String EMAIL_PATTERN =
"^[_A-Za-z0-9-\\+]+(\\.[_A-Za-z0-9-]+)*#"
+ "[A-Za-z0-9-]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})$";
#Override
public void validate(FacesContext context, UIComponent component,
Object value) throws ValidatorException {
ArrayList<String> aux = (ArrayList<String>) value;
String error = null;
final String EMAIL_PATTERN =
"^[_A-Za-z0-9-\\+]+(\\.[_A-Za-z0-9-]+)*#"
+ "[A-Za-z0-9-]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})$";
//Check if user has typed only blank spaces
if(value.toString().trim().isEmpty()){
FacesMessage msg = new FacesMessage("ERROR: Email requiere un formato válido", "Email: requiere un formato válido");
msg.setSeverity(FacesMessage.SEVERITY_ERROR);
throw new ValidatorException(msg);
}
else {
pattern = Pattern.compile(EMAIL_PATTERN);
for (String temp : aux){
boolean valid = this.validate(temp);
if (valid == false){
if (error == null) error = temp;
else error = error + "\n" + temp;
}
}
if (error != null){
chips.setValid(false);
FacesMessage msg = new FacesMessage("ERROR DE FORMATO: "+"\n" + error, " ");
msg.setSeverity(FacesMessage.SEVERITY_ERROR);
chips.setValid(false);
throw new ValidatorException(msg);
}
}
}
Thanks all
There is currently no way to set seperate chip-elements invalid. The primefaces component org.primefaces.component.chips.Chips doesn't provide means for that in version 6.1. You could of course open a feature request on github.
The only way for now is to set the whole component invalid and provide an appropriate text inside your FacesMessage
((UIInput) component).setValid(false);

Wikipedia page parsing program caught in endless graph cycle

My program is caught in a cycle that never ends, and I can't see how it get into this trap, or how to avoid it.
It's parsing Wikipedia data and I think it's just following a connected component around and around.
Maybe I can store the pages I've visited already in a set and if a page is in that set I won't go back to it?
This is my project, its quite small, only three short classes.
This is a link to the data it generates, I stopped it short, otherwise it would have gone on and on.
This is the laughably small toy input that generated that mess.
It's the same project I was working on when I asked this question.
What follows is the entirety of the code.
The main class:
public static void main(String[] args) throws Exception
{
String name_list_file = "/home/matthias/Workbench/SUTD/nytimes_corpus/NYTimesCorpus/2005/01/02/test/people_test.txt";
String single_name;
try (
// read in the original file, list of names, w/e
InputStream stream_for_name_list_file = new FileInputStream( name_list_file );
InputStreamReader stream_reader = new InputStreamReader( stream_for_name_list_file , Charset.forName("UTF-8"));
BufferedReader line_reader = new BufferedReader( stream_reader );
)
{
while (( single_name = line_reader.readLine() ) != null)
{
//replace this by a URL encoder
//String associated_alias = single_name.replace(' ', '+');
String associated_alias = URLEncoder.encode( single_name , "UTF-8");
String platonic_key = single_name;
System.out.println("now processing: " + platonic_key);
Wikidata_Q_Reader.getQ( platonic_key, associated_alias );
}
}
//print the struc
Wikidata_Q_Reader.print_data();
}
The Wikipedia reader / value grabber:
static Map<String, HashSet<String> > q_valMap = new HashMap<String, HashSet<String> >();
//public static String[] getQ(String variable_entity) throws Exception
public static void getQ( String platonic_key, String associated_alias ) throws Exception
{
//get the corresponding wikidata page
//check the validity of the URL
String URL_czech = "https://www.wikidata.org/wiki/Special:ItemByTitle?site=en&page=" + associated_alias + "&submit=Search";
URL wikidata_page = new URL(URL_czech);
HttpURLConnection wiki_connection = (HttpURLConnection)wikidata_page.openConnection();
InputStream wikiInputStream = null;
try
{
// try to connect and use the input stream
wiki_connection.connect();
wikiInputStream = wiki_connection.getInputStream();
}
catch(IOException e)
{
// failed, try using the error stream
wikiInputStream = wiki_connection.getErrorStream();
}
BufferedReader wiki_data_pagecontent = new BufferedReader(
new InputStreamReader(
wikiInputStream ));
String line_by_line;
while ((line_by_line = wiki_data_pagecontent.readLine()) != null)
{
// if we can determine it's a disambig page we need to send it off to get all
// the possible senses in which it can be used.
Pattern disambig_pattern = Pattern.compile("<div class=\"wikibase-entitytermsview-heading-description \">Wikipedia disambiguation page</div>");
Matcher disambig_indicator = disambig_pattern.matcher(line_by_line);
if (disambig_indicator.matches())
{
//off to get the different usages
Wikipedia_Disambig_Fetcher.all_possibilities( platonic_key, associated_alias );
}
else
{
//get the Q value off the page by matching
Pattern q_page_pattern = Pattern.compile("<!-- wikibase-toolbar --><span class=\"wikibase-toolbar-container\"><span class=\"wikibase-toolbar-item " +
"wikibase-toolbar \">\\[<span class=\"wikibase-toolbar-item wikibase-toolbar-button wikibase-toolbar-button-edit\"><a " +
"href=\"/wiki/Special:SetSiteLink/(.*?)\">edit</a></span>\\]</span></span>");
Matcher match_Q_component = q_page_pattern.matcher(line_by_line);
if ( match_Q_component.matches() )
{
String Q = match_Q_component.group(1);
// 'Q' should be appended to an array, since each entity can hold multiple
// Q values on that basis of disambig
put_to_hash( platonic_key, Q );
}
}
}
wiki_data_pagecontent.close();
// \\ // ! PRINT IT ! // \\ // \\ // \\ // \\ // \\ // \\
for (Map.Entry<String, HashSet<String> > entry : q_valMap.entrySet())
{
System.out.println(entry.getKey()+" : " + Arrays.deepToString(q_valMap.entrySet().toArray()) );
}
}
// add Q values to their arrayList in the hash map at the index of the appropriate entity
public static HashSet<String> put_to_hash(String key, String value )
{
HashSet<String> valSet;
if (q_valMap.containsKey(key)) {
valSet = q_valMap.get(key);
} else {
valSet = new HashSet<String>();
q_valMap.put(key, valSet);
}
valSet.add(value);
return valSet;
}
// add Q values to their arrayList in the hash map at the index of the appropriate entity
public static void print_data()
{
System.out.println("THIS IS THE FINAL DATA SET!!!");
// \\ // ! PRINT IT ! // \\ // \\ // \\ // \\ // \\ // \\
for (Map.Entry<String, HashSet<String> > entry : q_valMap.entrySet())
{
System.out.println(entry.getKey()+" : " + Arrays.deepToString(q_valMap.entrySet().toArray()) );
}
}
Dealing with disambiguation pages:
public static void all_possibilities( String platonic_key, String associated_alias ) throws Exception
{
System.out.println("this is a disambig page");
//if it's a disambig page we know we can go right to the Wikipedia
//get it's normal wiki disambig page
String URL_czech = "https://en.wikipedia.org/wiki/" + associated_alias;
URL wikidata_page = new URL(URL_czech);
HttpURLConnection wiki_connection = (HttpURLConnection)wikidata_page.openConnection();
InputStream wikiInputStream = null;
try
{
// try to connect and use the input stream
wiki_connection.connect();
wikiInputStream = wiki_connection.getInputStream();
}
catch(IOException e)
{
// failed, try using the error stream
wikiInputStream = wiki_connection.getErrorStream();
}
// parse the input stream using Jsoup
Document docx = Jsoup.parse(wikiInputStream, null, wikidata_page.getProtocol()+"://"+wikidata_page.getHost()+"/");
//this can handle the less structured ones.
Elements linx = docx.select( "p:contains(" + associated_alias + ") ~ ul a:eq(0)" );
for (Element linq : linx)
{
System.out.println(linq.text());
String linq_nospace = URLEncoder.encode( linq.text() , "UTF-8");
Wikidata_Q_Reader.getQ( platonic_key, linq_nospace );
}
}

Resources