Preserve newlines in toot corpus

The original code was already trying to do this, but in a way that
Beautiful Soup ended up stripping out. This way preserves the newlines
properly, which will prevent the bot from smooshing together words
This commit is contained in:
Danielle McLean 2021-08-21 15:09:59 +10:00
parent c0f8f1da38
commit f67fbefb5e
Signed by: 00dani
GPG Key ID: 9DDE1EDE01E3A605
1 changed files with 2 additions and 2 deletions

View File

@ -80,10 +80,10 @@ def extract_toot(toot):
toot = html.unescape(toot) # convert HTML escape codes to text
soup = BeautifulSoup(toot, "html.parser")
for lb in"br"): # replace <br> with linebreak = "\n"
for p in"p"): # ditto for <p> = "\n"
for ht in"a.hashtag"): # convert hashtags from links to text