I'm using the following ngram tokenizer to process 15000 documents (and expect it to grow to up to a million documents), each with up to 6000 characters (avg 10
iiif
react-native-drawer
drupal-9
horizontal-accuracy
extentreports
copydataset
orthogonal
importdata
aic
asyncfileupload
outlook-graph-api
boot
pivot-chart
flutter-dotenv
meteor
micro-orm
arules
wix-extension
replication
uifontdescriptor
system.version
ghostscript.net
inline-formset
internet-explorer-8
smb
factorization-machines
jfrog-pipelines
neptune
devops-insights
font-style