'Adding a Space Using Join and List Comprehension

I am trying to parse some html using BeautifulSoup. The page is behind a paywall so I cannot link to it. However, I don't think the html is causing the issue.

When I parse the html using line below the text in div "place-names" does not contain a space so looks like this:

LondonParis instead of London Paris

'\n'.join([item.text for item in items.find_all(["div"], {"class": ["svvf", "place-names"]})]).strip()

To solve this I thought the following would work:

'\n'.join([' '.join(y) for y in [item.text for item in items.find_all(["div"], {"class": ["svvf", "place-names"]})]]).strip()

But this returns:

L o n d o n P a r i s

I'm still learning Python and can't work out what I am doing wrong. Can anyone help?



Solution 1:[1]

I suspect your problem lies in ' '.join(y) for y....

You're calling join on individual strings. Strings are iterators, so it's going to treat each character in the string as its own object to join.

That results in the code putting a space between each of the letters in the string (i.e., "London" becomes "L o n d o n", which is then concatenated with "P a r i s", etc.).

If you take out the extra list comprehension and call the space-join on the list of text items, it should work:

# Get items
item_list = items.find_all(["div"], {"class": ["svvf", "place-names"]})

# Get text from each item
item_text_list = [item.text for item in item_list]

# Call join on the item_text_list, not each individual object within that list
item_text_joined = '\n'.join([' '.join(item_text_list)]).strip()

Solution 2:[2]

Looking at just 1 iteration. When you use ' '.join(y), it's like doing ' '.join('London')

This will take each character in join with ' '

' '.join('London')
Out[38]: 'L o n d o n'

So you are essentionally double looping. You are doing a list comprehension to create the list of ['London', 'Paris'], then doing ' '.join('London') and ' 'join('Paris')

Your first code is actually what you want with:

'\n'.join([item.text for item in items.find_all(["div"], {"class": ["svvf", "place-names"]})]).strip()

But if you want a space in stead of a new line, change it to:

' '.join([item.text for item in items.find_all(["div"], {"class": ["svvf", "place-names"]})]).strip()

Look at the code and see the difference:

html = '''<div class="svvf">London</div>
<div class="place-names">Paris</div>'''


from bs4 import BeautifulSoup

items = BeautifulSoup(html, 'html.parser')

print('With new line:')
print('\n'.join([item.text for item in items.find_all(["div"], {"class": ["svvf", "place-names"]})]).strip())

print('')

print('With the space:')
print(' '.join([item.text for item in items.find_all(["div"], {"class": ["svvf", "place-names"]})]).strip())

Output:

With new line:
London
Paris

With the space:
London Paris

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jeff Martin
Solution 2 HedgeHog