'Split a long text in two or more parts each one with a maximum length in python
Let's suppose I have a long text that I want to process with an API having a maximum number of allowed characters (N). I would like to split that text into 2 or more texts with shorter than N characters, and based on a separator. I know I could split by separator but I would like to keep the number of output sub-texts the smallest as possible.
For example, suppose my text is:
"Lorem ipsum dolor sit amet, odio salutandi id nam, ferri nostro te duo. Eum ex odio habeo qualisque, ne eos natum graeco. Autem voluptatum ex mea. Nulla putent reformidans cu pro, posse recusabo reprehendunt pro no. An sit ludus oblique. Consulatu cotidieque ex sea, nam no duis prompta expetendis.
Est ne tempor quaestio complectitur, modo error vim et. Option voluptaria efficiantur te eam, ea appareat evertitur qui, te vix pertinax recteque. Mea eu diceret ceteros. Expetenda torquatos assueverit est ex, te reque voluptatibus signiferumque has."
which is 550 characters long. Let's suppose that N is 250. I would expect the text to be split in this way:
Part 1: "Lorem ipsum dolor sit amet, odio salutandi id nam, ferri nostro te duo. Eum ex odio habeo qualisque, ne eos natum graeco. Autem voluptatum ex mea. Nulla putent reformidans cu pro, posse recusabo reprehendunt pro no. An sit ludus oblique" (237 characters)
Part 2: "Consulatu cotidieque ex sea, nam no duis prompta expetendis.
Est ne tempor quaestio complectitur, modo error vim et. Option voluptaria efficiantur te eam, ea appareat evertitur qui, te vix pertinax recteque. Mea eu diceret ceteros." (232 characters)
- Part 3: the remaining.
Any idea on how to do this in Python?
Thank you for any help. Francesca
Solution 1:[1]
n = 250
text = """Lorem ipsum dolor sit amet, odio salutandi id nam, ferri nostro te duo. Eum ex odio habeo qualisque, ne eos natum graeco. Autem voluptatum ex mea. Nulla putent reformidans cu pro, posse recusabo reprehendunt pro no. An sit ludus oblique. Consulatu cotidieque ex sea, nam no duis prompta expetendis.
Est ne tempor quaestio complectitur, modo error vim et. Option voluptaria efficiantur te eam, ea appareat evertitur qui, te vix pertinax recteque. Mea eu diceret ceteros. Expetenda torquatos assueverit est ex, te reque voluptatibus signiferumque has."""
if len(text) >= 550:
print(text[0:n-1])
print(text[n:])
else:
print(text)
So you can have a variable n with the length (250 in your example). Then it checks if the length of the text is greater or equal 550 chars. If yes it's going to print everything from char 0 up to the length n (minus 1 so you get the first 250 not the first 251 characters). Then it is going to do this for the second part: from n to the end.
Solution 2:[2]
You can create a function, that can return the chunks of desired length.
In [13]: def split(N, text):
...: chunks = [text[i:i+N] for i in range(0, len(text), N-1)]
...: return chunks
This will return the chunks in the format of list. i.e
text = "Lorem.................." # complete lorem ispm
chunks = split(250, text)
print(len(s[0]), len(s[1]), len(s[2]))
And the output lengths will be
250 250 50
Solution 3:[3]
This is a possible solution:
def split_txt(txt, sep, n):
if any(len(s) + 1 > n for s in txt.split(sep)):
raise Exception('The text cannot be split')
result = []
start = 0
while start + n <= len(txt):
result.append(txt[start:start + n].rsplit(sep, 1)[0] + sep)
start += len(result[-1])
if start < len(txt):
result.append(txt[start:])
return result
Solution 4:[4]
You might consider building a child class of the built-in TextWrapper tools, using the other answers insights. Base class lets you specify rules to handle a text : max number of columns (width), max number of lines, handling of hyphens and so on.
The textwrap module provides some convenience functions, as well as TextWrapper, the class that does all the work. If you’re just wrapping or filling one or two text strings, the convenience functions should be good enough; otherwise, you should use an instance of TextWrapper for efficiency. [emphasis mine]
The basic class itself does not treat the specifics of OP problem, but it is worth having a look at it for anyone landing on this page.
Stuff in this section may also give some inspiration : https://docs.python.org/3/library/text.html#stringservices
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Henrik |
| Solution 2 | Ahmad Anis |
| Solution 3 | Riccardo Bucco |
| Solution 4 | LoneWanderer |
