Coding STEM

Splitting a continuous blob of text

There are times we copy a bunch of text from a web page or a trascript online where formatting is not copied and therefore everything copied comes in the clipboard as a large continuous text blob without line breaks.

We often need to split a continous series of sentences into their individual lines for better readability or analysis or individual formatting of each sentence. That’s exactly what we’re going to do in this article using Python.

Consider this paragraph:

The quick brown fox jumps over the lazy dog. This sentence is a classic example used to showcase all the letters of the English alphabet in a single sentence. It is often used by typists and designers to test fonts and layouts. In addition to being a useful tool, this sentence has become quite famous in its own right and is often referenced in popular culture. The world is a vast and diverse place, full of many different cultures, languages, and traditions. From the bustling cities of New York and Tokyo to the remote villages of Africa and South America, there are countless ways of life that exist in our world. Despite these differences, there are also many things that unite us as humans, such as our desire for connection, our ability to feel joy and pain, and our shared experiences of love and loss. It is these commonalities that allow us to bridge the gaps between us and come together as a global community.

So, in order to split these contiguous lines into individual sentences, we need to do a few things first:

Split the contiguous string into individual elements in an array (aka List) using a delimiter character or string depending on the situation.
Next, join the individual elements into a string again but using a linebreak as the delimiter in-between.

Verify if that worked perfectly. If not, tweak it to perfect it.

From the given text above, what we want is this:

The quick brown fox jumps over the lazy dog.
This sentence is a classic example used to showcase all the letters of the English alphabet in a single sentence.
It is often used by typists and designers to test fonts and layouts.
In addition to being a useful tool, this sentence has become quite famous in its own right and is often referenced in popular culture.
The world is a vast and diverse place, full of many different cultures, languages, and traditions.
From the bustling cities of New York and Tokyo to the remote villages of Africa and South America, there are countless ways of life that exist in our world.
Despite these differences, there are also many things that unite us as humans, such as our desire for connection, our ability to feel joy and pain, and our shared experiences of love and loss.
It is these commonalities that allow us to bridge the gaps between us and come together as a global community.

Going into code, let’s suppose the following variable:

long_string = '''\
The quick brown fox jumps over the lazy dog. \
This sentence is a classic example used to showcase all the letters of the English alphabet in a single sentence. \
It is often used by typists and designers to test fonts and layouts. In addition to being a useful tool, \
this sentence has become quite famous in its own right and is often referenced in popular culture. \
The world is a vast and diverse place, full of many different cultures, languages, and traditions. \
From the bustling cities of New York and Tokyo to the remote villages of Africa and South America, \
there are countless ways of life that exist in our world. Despite these differences, there are also many things \
that unite us as humans, such as our desire for connection, our ability to feel joy and pain, \
and our shared experiences of love and loss. It is these commonalities that allow us to bridge \
the gaps between us and come together as a global community.
'''

Next we split it using “. ” as the delimiter because there’s a period at end of each sentence followed by a space before the next sentence. We’ll deal with the last line later.

long_paragraph = long_string.split(". ")

What we get is a list object as can be verified by the code: type(long_paragraph)

Now we have an array or a list object with each sentence as an element. However, without the periods at end. We’ll deal with it soon.

Next we need to convert the list items into one string by joining the elements by line breaks or “\n”.

result = "\n".join(long_paragraph)

This string now has line breaks after each sentence but remember that this string does no longer have periods (since we used it as a delimiter to split the text blob into sentences earlier)! So, if we want to add a period after each line, we need to put them back in the right places. We can do that simply by replacing the “\n” to “.\n” in the string.

result = result.replace('\n', '.\n')

Now we see every sentence appears on their own lines and ends with a neat period as follows:

The quick brown fox jumps over the lazy dog.
This sentence is a classic example used to showcase all the letters of the English alphabet in a single sentence.
…(continues)…
It is these commonalities that allow us to bridge the gaps between us and come together as a global community..

But did you notice a problem?? While rest of the lines look great, the last line has an extra period at end!

Why is that? It’s because the last line ends with a period then has an invisible line break (‘\n’) in the string value when we declared the variable long_string above. Depending on how you entered the string value, you may have to tweak the code to fit the situation. The point is to understand how to tweak the code in which situations. So, since long_string content (now variable result contains the latest values) had an extra empty line at end a period was added for that line (which the code ‘thinks’ is a sentence) due to the replace() function above.

The fix is not to remove the replace() as we absolutely need it, but simply to remove the last extra period when we this pattern “.\n”…in fact, that is 2 characters, not 1. Because ‘.’ is one, and ‘\n’ is another…so we need to remove 2 characters from the end of the string contained in variable result ONLY when we find that the string ends with “.\n”. This is intuitively and aptly done by endswith() method of String class.

if result.endswith(".\n"):
result = result[:-2]

The final string contained in result variable now looks perfect:

The quick brown fox jumps over the lazy dog.
This sentence is a classic example used to showcase all the letters of the English alphabet in a single sentence.
It is often used by typists and designers to test fonts and layouts.
In addition to being a useful tool, this sentence has become quite famous in its own right and is often referenced in popular culture.
The world is a vast and diverse place, full of many different cultures, languages, and traditions.
From the bustling cities of New York and Tokyo to the remote villages of Africa and South America, there are countless ways of life that exist in our world.
Despite these differences, there are also many things that unite us as humans, such as our desire for connection, our ability to feel joy and pain, and our shared experiences of love and loss.
It is these commonalities that allow us to bridge the gaps between us and come together as a global community.

Hope you found this article helpful.


Interested in creating programmable, cool electronic gadgets? Give my newest book on Arduino a try: Hello Arduino!

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top