[lug] Python and Unicode - Good Grief! (A Rant)

Jed S. Baer blug at jbaer.cotse.net
Tue Mar 7 12:08:16 MST 2017

Hi Folks.

Yes, this is a rant. Maybe Python doesn't follow Larry Wall's "simple
things should be easy" dictum. Well, it sure isn't helping me 1) get
something done, and 2) learn the language. Behold!

I'm trying to write a web page scraper to produce RSS that I can feed
into Liferea (a news aggregator). Yeah, it'd be nice if the site provided
RSS, but it doesn't, and to call the HTML malformed would be a
compliment. But hey, this should be do-able. So, I grab BeautifulSoup to
see if I can come up with a nice link extractor. The HTML is really bad,
so I think I'll use the "prettify" method to see what it looks like
nested, so I can, I hope, visually identify the child/parent structure.

So, I run my script with output to the terminal. Fine, except that the
indentation ends up wrapping. No biggie, I'll just redirect to a file.

$ ./scrape2rss.py > foo.html
Traceback (most recent call last):
  File "./scrape2rss.py", line 34, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in
position 27: ordinal not in range(128)

Oh, good grief. I guess I must be thinking backwards. If there's going to
be a problem with character set conversion, wouldn't it be when sending
output to a terminal, with all the TERMCAP and LC_LOCALE stuff being
used? A file? Who cares? Just send the bits through the redirect.

Yes, I know that someplace, in all the classes and methods underlying all
the cool stuff that BeautifulSoup is trying to "help" me with, there's
some stuff I'm sure I can do to tell it to use a codepage, or UTF8, or
something. But really. Aaaargh.

(Yes, I know this is really not Python, as such, but something else in
the bowels of some module being pulled in by BeatifulSoup.)

cf. https://www.crummy.com/software/BeautifulSoup/bs4/doc/

More information about the LUG mailing list