'NLP: check if a detected sentence is a complete sentence
In my NLP project I build my own model to identify sentences in a PDF document. Now I would like to check if my extracted sentences are complete sentences. During my research I have already come across this question, with the solutions presented there allowing quite a few false positives. Does anyone perhaps have a tip on how I can check whether a sentence is a complete sentence?
Solution 1:[1]
This is a non-trivial problem, so no approach will work in each and every case. You should also consider that whatever parser you use might merge or split sentences which in the original document were complete sentences, but after they are parsed are not any more.
Generally an alternative to the purely rule-based approaches: you could use a model which was pretrained on the CoLA (Corpus of Linguistic Acceptability) task. These models try to classify sentences/documents into the classes "linguistically acceptable" and "lingustically inacceptable".
On huggingface's model hub there are several pretrained transformer models for this, see for example this inference API for one which is a fine-tuned version of Facebook's RoBERTa model:
You should have a look at how the model was trained when it comes to bullet points/self-standing half sentences etc. though, as some scores might be surprising at first glance.
You might want to combine the models results with a rule-based approach, say for example: "The sentence is acceptable if the score is 0.95 or higher AND the sentence has at least 4 words AND ends with a . ? or !.". You can see what sentences your model + rule-based approach combinations spits out and keep modifying the rules until the results are to your satisfaction.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | ewz93 |
