I'm using the following ngram tokenizer to process 15000 documents (and expect it to grow to up to a million documents), each with up to 6000 characters (avg 10
wsgiserver
fee
instagram-boost
pylance
unity2d
android-tv
modern-ui
magnify
delay-bind
google-email-settings-api
publickeytoken
boost-process
inversifyjs
redash
assembly-resolution
application-security
heroku
wifi-direct
docker-image
ibm-integration-bus
jsonata
leaky-abstraction
version
.ctf
neopixel
typo3
cognos-11
feature-scaling
vds
calllog