'Why does fasttext yield <\s> as first entry in VSM?

I am using a large German corpus, which I have cleaned of all special characters/numbers/inter-punctuation signs. Each line contains one sentence.

Running

fastText/./fasttext skipgram -input input.txt -output output.txt
-minCount 2 -minn 2 -maxn 8 -dim 300 -ws 5

returns a VSM with <\s> as first entry.
From how I understand it, there are white spaces left in the document that are interpreted as a token.
Is that correct?
And how can I get rid of them and/or the <\s> in the VSM?

Thank you.



Solution 1:[1]

By convention the fasttext tool converts any newlines in the input file to a pseudoword token '<\s>', to represent an end-of-string ('EOS').

See the discussion in the Python binding Markdown docs:

https://github.com/facebookresearch/fastText/blob/main/python/README.md#important-preprocessing-data--encoding-conventions

The newline character is used to delimit lines of text. In particular, the EOS token is appended to a line of text if a newline character is encountered. The only exception is if the number of tokens exceeds the MAX_LINE_SIZE constant as defined in the Dictionary header. This means if you have text that is not separate by newlines, such as the fil9 dataset, it will be broken into chunks with MAX_LINE_SIZE of tokens and the EOS token is not appended.

The length of a token is the number of UTF-8 characters by considering the leading two bits of a byte to identify subsequent bytes of a multi-byte sequence. Knowing this is especially important when choosing the minimum and maximum length of subwords. Further, the EOS token (as specified in the Dictionary header) is considered a character and will not be broken into subwords.

(Though only mentioned in that doc about the Python bindings, it's definitely defined/implemented in the core C++ code, especially the dictionary.cc file.)

To eliminate that word-token, you'd have to strip all newlines from your input file.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 gojomo