Sunday, 15 March 2015

python - Beautifulsoup special character parsing error -


I am using beautiful soup and urllib2 to collect content from the internet. I'm using this code.

import beautiful codec from bs4 import urllib2 html = urllib2.urlopen ('http://plrplr.com/33717/mp3-player-guide/').

But I () soup = beautiful soup (html, "lxml") content = soup. Identity ('div', {'class': 'entry-content'}) is getting results like this ...

  & lt; Div class = "entry-content" & gt; & Lt; P & gt; The MP3 player, also known as the Digital Audio Player, has become a major head of our gadget life. Today there are many MP3 players in the market, so what mp3 player is most suitable for you? That's where this MP3 player comes in. & Lt; Br / & gt; Basically, there are 3 types of MP3 players based on capacity: รข € "1. Hard drive MP3 player 
- Highest capacity & lt; br / & Gt; - Largest in size & lt; br / & gt; heavy
often labeled as a "jukebox mp3 player"? & Lt; Br / & gt; - Parts running & lt; Br / & gt; Example: Apple iPod Video, Sony Network Walkman NW-HD5

There is a problem while working with a special character.

< P> How can I get an exact source code like this ...

  
; p> MP3 player , Which is also known as a digital audio player, has become a key to the life of our gadgets. Today there are many brands of mp3 players in the market, who come to the MP3 player Which is the most suitable for rake? & Amp; # 8217; where this MP3 player comes in. & Lt; / br & gt; & lt; br / & gt; Basically, the MP3 player There are 3 types: & amp; # 8211; & lt; / br & gt; & lt; br / & gt; 1. Hard drive MP3 player & lt; / br & gt; & lt; br / & gt; & Amp; # 8211; Highest capacity & lt; / br>
Amp; # 8211; Largest in size & lt; / br & gt; & lt; br / & gt; & Amp; # 8211; heavy & lt; / br & gt; & lt; br / & gt; & amp; # 8211; Often a & amp; # 8220; jukebox mp3 player & amp; # 8221; & Lt; / Br & gt; & Lt; Br / & gt; & Amp; # 8211; Moving parts & lt; / Br & gt; & Lt; Br / & gt; & Amp; # 8211; Example: Apple iPod Video, Sony Network Walkman NW-HD5 & lt; / Br & gt; & Lt; Br / & gt;

I'm using this code in the Windows 8 machine, Eclipse and Pyde.

Perhaps what you are looking for is contents.prettify (formatter = "html") to show unit code instead of non-ASCII characters?

I could not test that on my machine, but here are the documents I used:


No comments:

Post a Comment