I'm using the following ngram tokenizer to process 15000 documents (and expect it to grow to up to a million documents), each with up to 6000 characters (avg 10
data-distribution-service
profile-owner
requestscope
rapidfuzz
android-studio-debugger
jsondecoder
autodesk-data-management
sandcastle
android-livefolders
pdfjam
openslide
alan
luarocks
mergesort
cllocationdistance
portal
ruby-tag
react-spinners
azureportal
jquery-slider
fail-fast
quarkus-undertow
container-runtime-interface
ghc
civicrm
pants
webgl-globe
instascan
swiftdate
waithandle