I'm using the following ngram tokenizer to process 15000 documents (and expect it to grow to up to a million documents), each with up to 6000 characters (avg 10
method-missing
monkeylearn
richfaces
hookrouter
snapshot
edit-in-place
jquery
bing-webmaster-tools
queryselectall
msbuild-itemgroup
eigenvalue
avassetexportsession
zeitwerk
specflow
java-3d
spring-saml
openstack-neutron
adjacency-matrix
emotion
delphi-prism
socketserver
istio-kiali
tsickle
git-bundle
procmon
camera-roll
django-tables2
travelport-api
dexclassloader
latin1