Wednesday 15 May 2013

python - How to scrape between span tags using beautifulsoup -


I am trying to scrape the text using Python + beautiful soup. I type in the code:

  soup.find_all ("span") [0]  

It gets me:

< Pre> & lt; Span style = "margin-right: 0.9m" & gt; TEXT1 & lt; Span style = "color: # 111111; margin-left: 0.2 amy" & gt; TEXT2 & lt; / Span & gt; TEXT3 & lt; / Span & gt;

This is great but the problem is that I am removing TEXT1 , TEXT2 , and TEXT3 Want to

I do not know how to do this. If I do:

soup.find_all ("span") [0] .find_all ("span"), only me

  & lt; Span style = "color: # 111111; margin-left: 0.2 AM" & gt; TEXT2 & lt; / Span & gt;  

I think this is because in this particular & lt; Span & gt; and & lt; / Span & gt; Both include. How do I pick TEXT1 , TEXT2 , and TEXT3 ?

With a little formatting, we can see what kind of structure you have:

  & lt; Span style = "margin-right: 0.9m" & gt; TEXT1 & lt; Span style = "color: # 111111; margin-left: 0.2EM" & gt; TEXT2 & lt; / Span & gt; TEXT3 & lt; / Span & gt;  

Therefore, unfortunately, we can not use an approach because we want to access the NavigableText elements in many depths.

The only way to do this is to create a function that checks the children of a given element (which is called context under the following functions) , And if they are NavigableText elements, then assemble them

Consider: navigableString, tag def extractNavigableStrings (context) from bs4.element import: string = [] context.children In: Children: if Isinstance (e, navigableString): strings.append (e) ifinstance (e, tag): string.andend (extractivevigative strings (e)) Signal string

How we can run on your input:

  Beautiful soup from bs4 import Import Import Navigating String, tag def extractNavigableStrings (reference): strings = [] context.children for e: ifinstance (e, navigable string): strings.append (e) if iStance (e, tag): strings.andend (ExtractNavigative String (E) Return String Soup = Beautiful Soup ('' '& lt; span style = "margin-right: 0.9em"> TEXT1 & lt; span style = "color: # 111111;  

and print

and margin-left: 0.2 AM "TEXT2 TEXT3 & lt; / span & gt; / Code> Function Our Nav Displaying our list of widgets

 [u'TEXT1 ', u'TEXT2', u'TEXT3 '] 

, note that the element is returned The list does not have python wire, they are navigable string element - it's ok to sort them out, but if you want to get string content, then you Unicode (& lt; element & gt;) For example:

  nss = extractNavigableStrings (soup) string = [NIS for ns.string]] print strings # [u'TEXT1 ', u 'TEXT2', 'UITTECTA 3'] in strings: print type (s), s # & lt; Type 'Unicode' & gt; TEXT1 # & lt; Type 'Unicode' & gt; TEXT2 # & lt; Type 'Unicode' & gt; TEXT3  

No comments:

Post a Comment