Friday, 15 June 2012

java - Correct way to write a Tokenizer in Lucene -


I am trying to analyze the content of a Drupal database for group intelligence purposes.

So far I've been able to count down token after removing a simple example that works tokenizes out various content (mainly platform posts) and removing the word.

StandardTokenizer should be able to supply with Lucene to tokenize host names and emails, but the content can also be embedded, such as HTML.

  Pubblichiamo La presentazione di IBM riguardante per db 2 i vari sistemi operativi linux, unix e windows \ r \ n \ r \ nQuesto documento station sulla piattaforma kilometer e lo potete do a questo & lt SCARICARE ; A href = \ 'https: //sfkm.griffon.local/sites/BSF%20KM/BSF/CC%20T/Specifiche/Eventi2008/IBM%20DB2%20for%20Linux%20UNIX%20e%20Windows.pdf \' Target = blank & gt; Link & lt; / A & gt;  

This is badly marked like this:

  Pablachio -> 1 appearance - & gt; 1 IBM - & gt; 1st rigward - & gt; 1 db2 - & gt; 1 different - & gt; 1 system - & gt; 1 Operation - & gt; 1 Linux - & gt; 1 UNIX - & gt; 1 windows - & gt; 1 document - & gt; 1 Envelope - & gt; 1km - - & gt; 1 box - & gt; 1 Scratch - & gt; 1 href - & gt; 1 https - & gt; 1 sfkm.griffon.local - & gt; 1 site - & gt; 1 BSF - & gt; 1 20 km / BSF - & gt; 1 cc - & gt; 1 20t / specifiche / eventi2008 / ibm - & gt; 1 20db2 - & gt; 1 20for -> 1 20linux - & gt; 1 20unix - & gt; 1 20e - & gt; 1 20windows.pdf - & gt; 1 goal - & gt; 1 free - & gt; 1 link - & gt; 1  

I have a link and HTML tags (such as

  or   ) Which are useless. 

Should I write a filter or a different tokener? Tokenizer standard should be a place or can I mix them together? The most difficult way is to get the StandardTokenizerImpl and copy it to a new file, then add custom behavior, but I do not want to go too deep into the Lausanne implementation for learning (slowly learning).

Edit: Seeing StandardTokenizerImpl makes me think that if I increase it by modifying the actual implementation So it is not so convenient compared to lex or flex and is doing it by itself ..

< Div class = "post-text" itemprop = "text">

This can be easily achieved before the text is processed so that it is useful to pre-tune it before Tunkenize Can be done. Use an HTML parser, as if you can not convert your content into text with any HTML, which do not care about you, and to remove text from the lessons you do, Jericho's Perfect for, and easy to use.

  String Text = "Linux" Pubblichiamo La presentazione di IBC riguardante per vari sistemi operativi db2 ", Unix e Windows. \ R \ n \ r \ nQuesto Documento Station Sulla piattaforma kilometers e Take potete "+" Do a quest & lt SCARICARE; a href = \ 'https: //sfkm.griffon.local/sites/BSF%20KM/BSF/CC%20T/Specifiche/Eventi2008/ IBM% 20DB2% 20for% 20Linux ,% 20UNIX% 20e% 20Windows.pdf \ 'target = blank & gt; Link & lt; / a & gt ;. "; TextExtractor te = new TextExtractor (new source (text)) {@Override Public Boolean excludeElement (STARTTAG STARTTAG) {return startTag.getName () = HTMLElementName.A; }}; Println (te.toString ());  

This output:

To install DB2 for IBM developers, use DB2 as well as operating Linux, Unix and Windows Could. Questo Documento STA Solana is a search link for Piaatopha pharma's AMA.

You can use a custom Lusen tokenser with a custom filter, but this is not the easiest solution - you will be saved using the Jericho time for this task Lucen The existing HTML analyst might just want what you want because they will keep all the text on the page. The only warning on this is that you will finish the process of text two times instead of the same stream, but unless you are dealing with the terabytes of data, you are not careful about this performance, and You are best left to deal with the display. Your app has been flashed and it is somehow recognized as an issue.


No comments:

Post a Comment