I'm using the following ngram tokenizer to process 15000 documents (and expect it to grow to up to a million documents), each with up to 6000 characters (avg 10
healthcheckindicator
command-objects
2-way-object-databinding
syntax-highlighting
doctrine-query
express-fileupload
in-app
virtual-machine
meteor-up
telecom-manager
wmctrl
gipc
form-submit
esp8266
python-twitter
htmlelements
shared
openstack-python-api
sedlex
global-static-external-ip-address
dill
conflict-serializability
adaboost
docker-buildkit
webpagetest
hadoop3
storage-class-specifier
gridcontrol
signalr.net-client
keyfilter