I'm using the following ngram tokenizer to process 15000 documents (and expect it to grow to up to a million documents), each with up to 6000 characters (avg 10
gridjs
protx
jnaerator
post-format
relu
android-textinputlayout
laravel-routing
operations
bootstrap-lightbox
fb.ui
artoolkit
adaptive-dialogs
uicollectionviewlistcell
nsdate
web-services
dj-rest-auth
bresenham
redux-saga
jboss-arquillian
thesaurus
skrollr
finalize
websphere-traditional
rootviewcontroller
error-code
dylib
sphinx-apidoc
cheminformatics
ref-cursor
transient-failure