Sunday, 15 January 2012

nlp - How to have ngram tokenizer in lucene 5.0? -


I want to generate anagram character for the string. Below is Lucene 4.1 Lib I have used for it.

Reader Reader = New String Reader (Text); NigamTokanizer Gramatokanizer = New NGRM Talkener (Reader, 3, 5); // 3, 4 and 5 characters hold the extreme sequence CharTermAttribute charTermAttribute = gramTokenizer.addAttribute (CharTermAttribute.class); While (gramtokinizer.re CritToken ()) {string token = fourtamaitivet.tasting (); System.out.println (token);}

However, I want to use Lucene 5.0.0 to do this. NGramTokenizer references a lot in the Lucene 5.0.0 from the previous version.

Does anybody know how to use Lucene 5.0.0 to do ngrams?

The following code:

  stringreader stringword = new string reader (" a B C D"); NGramTokenizer tokensiser = new NGRM connectifier (1, 2); Tokenizer.setReader (stringReader); Tokenizer.reset (); CharTermAttribute termAtt = tokenizer.getAttribute (CharTermAttribute.class); While (tokenizer.incrementToken ()) {string token = termAtt.toString (); Println (token); }  

will generate:

  ABBCBC CDD  

No comments:

Post a Comment