I have a data set of requests obtained from numerous PCAP files and have loaded these PCAP files into R. Each PCAP file effectively refers to a single observation (row).
In this data set there is a "Request" column that gives a string regarding the request of the source. For example a request may read:
http://111.22.33.1/ilove/usingR/extraextra/sqli/?id='or1=1--
I want to tokenize each request string in order to run some machine learning algorithm on it. What would be the best way to tokenize strings like these in order to run some analysis on it? I know packages such as tm exist, but have had little experience with them.
I fear that you have first to examine your request variable and find similar patterns to help you find rules to tokenize your variable.
Then you could use str_splitwith the / pattern. If you keep the apparition number in the string, some models may find you the co-occurrence patterns in your requests.
Then do some analysis, like frequency check, for ip address and for the text.
tm is more for text corpus. Here, as it is "automated" created string, you probably could find some useful information with more classical methods first.
Related
It's a Drupal site with solr for search. Mainly I am not satisfied with current search result on Chinese. The tokenizer has broken the words into supposed small pieces. Most of them are reasonable. But still, it made mistakes by not treating something as a valid token either breaking it to pieces or not breaking it.
Assuming I am writing Chinese now: big data analysis is one word which shouldn't be broken. So my search on it should find it. Also I want people to find AI and big data analysis training as the first hit when they search the exact phrase AI and big data analysis training.
So I want a way to intervene or compensate the current tokens to make the search smarter.
Maybe there is a file in solr allow me to manually write these tokens down to relate them certain phrases? So every time when indexing, solr can use it as a reference.
You different steps to achieve what you want :
1) I don't see an extremely big problem with your " over tokenization" :
big data analysis is one word which shouldn't be broken. So my search on it should find it. -> your search will find it even if tokenized, I understand this was an example and the actual words are chinese, but I suspect a different issue there
2) You can use the edismax[1] query parser with phrase boost at various level to boost subsequent tokens or phrases ( pf,pf2,pf3...ps,ps2,ps3...)
[1] https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html , https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html#TheExtendedDisMaxQueryParser-ThepsParameter
How to handle misspelled words in Watson conversation API. NLP technique/Algorithm used in converation API calculates the word ranking and matches the trained data based on the rank.But how to handle the mispelled words or the short names in english.
At the moment there is nothing special to handle misspellings. The best process is to use the 'Synonyms' option within entities to add what you expect the user to use, including misspellings, short names, and acronym's.
We want to read the multiple CSV files generated at one go dynamically through Oracle PL/SQL or Oracle Proc (for one of our requirement) and we are looking some pseudo code snippets or logic to build the same.
We searched for the same but no luck. This requirement has to be done purely through Oracle and no Java is involved here.
I dealt with this problem in the past, and what I did was to write a (quite easy) parsing function similar to split. It should accept two variables: String and separator. It then returns a array of strings.
You then load the whole file into a text variable (declared big enough to hold the whole file) and then invoke the split function (with EOL as separator) to split the buffer into lines.
Then, for each line, invoke the parser again using comma as separation.
Though the parser is simple, you need to take into account possible conditions (e.g. bypass blanks that are not part of a string, single/double quotes management, etc.).
Unfortunately, I left the company at which the parser was developed, otherwise I would had post the source here.
Hope this helps you.
UPDATE: Added some PSEUDO-CODE
For the Parser:
This mechanism is based on a state-machine concept.
Define a variable that will reflect the state of the parsing; possible values being: BEFORE_VALUE, AFTER_VALUE, IN_STRING, IN_SEPARATOR, IN_BLANK; initially, you will be in state BEFORE_VALUE;
Examine each character of the received string and, based on the character and the current state;
It is up to you to decide what to do with blanks like in aaa,bbb, ccc,ddd, those before ccc (in my case, I ignored them).
Whenever you start or go through a value, you append the character to e temporary variable;
Once you finished a value, you add the collected sub-string (stored in the temporary variable) to the array of strings.
The state machine mechanism is needed to properly handle situations like when you have a comma as part of a string value (and hence it is not possible to simply search for commas and chop the whole string according to them),
Another point to take into account is empty values, which would be represented by two consecutive commas (i.e. in your state machine if you find a comma when your state is IN_SEPARATOR, it means that you just passed an empty value).
Note that exactly the same mechanism can be used for splitting the initial buffer into lines, and the each line into fields (the only different is the input string, the separator and the delimiter).
For the File handling process:
Load the file into a local buffer (big enough, preferable CLOB),
Split the file into records (using the above function) and then loop through the received records,
For each record, invoke the parser with the correct parameters (i.e. the record string, the delimiter, and ',' as separator),
The parser will return you the fields contained in the record with which you can proceed and do whatever you need to do.
Well, I hope this helps you to implement the needed code (looks complex, but it is not; just code it slowly taking into account the possible conditions you may encounter when running the state-machine.
I am looking for advise. The following website
http://brfares.com/#home
provides fares information for UK train lines. I would like to use it to build a database of travel costs for seasons tickets from different locations. I have never done this kind of thing before but have experience with Python/Bash scripting and some HTML.
Viewing the source code for a typical query the actual fair information is not displayed in index.html. Can anyone provide a pointer as to how to go about scraping (a new word for me) the information.
This is the url for the query : http://brfares.com/querysimple?orig=SUY&dest=0415&rlc=
the response is a json object.
First you need to build a lookup table of all destinations codes. you can use the following link to do that http://brfares.com/ac_loc?term=. Do it for all the letters in the alphabet and then parse for a unique list.
Then you take them by the pair, execute the json query, parse the returned json and feed the data to a database.
Now you can do whatever you want with that database.
I want to extract information from a large website and generate an ontology. Something that can be processed with description logic.
What data structure is advisable for the extracted html data?
My ideas yet:
- Use Data Frames, Table Structures
- Sets and Relations (sets package and good relations)
- Graphs
.
In the End I want to export the data and plan to process it with predicate logic (or description logic) using another programming language.
I want to use R to extraction information from html pages. But as I understand there is no direct support in R (or packages) for predicate logic or RDF/OWL.
So I need to do the extraction, use some data structure in the process and export the data.
Example Data:
SomeDocument rdf:type PDFDocument
PDFDocument rdfs:subClassOf Document
SomeDocument isUsedAt DepartmentA
DepartmentA rdf:type Department
PersonA rdf:type Person
PersonA headOf DepartmentA
PersonA hasName "John"
Where the instance data is "SomeDocument", "DepartmentA" and "PersonA".
.
If it makes sense, some sort of reasoning (but probably not in R):
AccessedOften(SomeDocument) => ImportantDocument(SomeDocument)
Most important is what does your website data look like? For instance, if it already has RDFa in it you would use an RDFa distiller to get the RDF out; simple; done. Then you could shove the RDF into a triple store. You could augment the website's data by creating your own ontology which you would query using SPARQL, if your ontology make equivalent classes to the data you found on your web site then you are golden. Many triple stores can be queried as SPARQL endpoints via URLs alone, and return in format of XML so even if R has no SPARQL or OWL ontolgoy packages per se, it doesn't mean you can't query the data at all.
If it requires a lot of pages to be downloaded I would use WGET to download those. To proces the files I would use a Perl script to transform the data to a more readable format eg. comma separated. Then I would turn to some programming language to combine in the way you describe, however, I would not go for R in this matter.