Thursday, 15 January 2015

python - Beautiful Soup filter function fails to find all rows of a table -


I am trying to parse a large HTML document using the Python Beautiful Soup 4 library.

Page has a very large table, like Structured:

  & lt; Table summary = 'Foo' & gt; & Lt; Tbody & gt; & Lt; TR & gt; A bunch of data & lt; / Tr & gt; & Lt; TR & gt; More data & lt; / Tr & gt; . . . & Lt; Tr & gt; Tag & lt; / Tbody & gt; & Lt; / Table & gt;  

I have a function that evaluates that I am looking like a tag given in soup.descendants . This is important because the page is big (beautiful soup tells me that there are about 4000 tags in the document). It is like this:

  def isrow (tag): if tag .name == u'tr ': if tag.parent.parent.name == u'table' and 'tag parent. Parent.has_attr ('summary'): Back to true  

My problem is that when I repeat through soup.descendants , the function is only Returns the table for the first 77 rows , when I know that for hundreds of rows & lt; Tr & gt; Tags are released

Is this a problem with my function or something that I did not understand how beautifuls prepares the collection of their ancestors? I doubt it may be a Python or BS 4 memory problem, but I do not know how to troubleshoot it.

Still like an educated guess, but I'll try it.

The way beautiful parses html depends heavily heavily. If you do not, then Beautiful will automatically select one based on internal ranking:

If you do not specify anything, you will get the best HTML The beautiful soup, which is the parser installed, is the best of the LXL parser, then html5lib, then it is ranked as the built-in python of Python.

In your case, I will try to switch parsers and see what you will get: Soup = LaxML is required to install beautiful soup (data, "LXML") # To install soup = beautiful soup (data, "HTML5 Lib") # html5lib is needed soup = beautiful soup (data "Html.parser") # Uses HTML constructor built in


No comments:

Post a Comment