DHara: python - How to scrape between span tags using beautifulsoup -

Wednesday, 15 May 2013

python - How to scrape between span tags using beautifulsoup -

I am trying to scrape the text using Python + beautiful soup. I type in the code:

  soup.find_all ("span") [0]

It gets me:

< Pre>

 & lt; Span style = "margin-right: 0.9m" & gt; TEXT1 & lt; Span style = "color: # 111111; margin-left: 0.2 amy" & gt; TEXT2 & lt; / Span & gt; TEXT3 & lt; / Span & gt;

This is great but the problem is that I am removing TEXT1 , TEXT2 , and TEXT3 Want to

I do not know how to do this. If I do:

soup.find_all ("span") [0] .find_all ("span"), only me

  & lt; Span style = "color: # 111111; margin-left: 0.2 AM" & gt; TEXT2 & lt; / Span & gt;

I think this is because in this particular & lt; Span & gt; and & lt; / Span & gt; Both include. How do I pick TEXT1 , TEXT2 , and TEXT3 ?

  With a little formatting, we can see what kind of structure you have: 
   & lt; Span style = "margin-right: 0.9m" & gt; TEXT1 & lt; Span style = "color: # 111111; margin-left: 0.2EM" & gt; TEXT2 & lt; / Span & gt; TEXT3 & lt; / Span & gt;  
  Therefore, unfortunately, we can not use an approach because we want to access the  NavigableText  elements in many depths. 
  The only way to do this is to create a function that checks the children of a given element (which is called  context  under the following functions) , And if they are  NavigableText  elements, then assemble them

Consider: navigableString, tag def extractNavigableStrings (context) from bs4.element import: string = [] context.children In: Children: if Isinstance (e, navigableString): strings.append (e) ifinstance (e, tag): string.andend (extractivevigative strings (e)) Signal string

How we can run on your input:

  Beautiful soup from bs4 import Import Import Navigating String, tag def extractNavigableStrings (reference): strings = [] context.children for e: ifinstance (e, navigable string): strings.append (e) if iStance (e, tag): strings.andend (ExtractNavigative String (E) Return String Soup = Beautiful Soup ('' '& lt; span style = "margin-right: 0.9em"> TEXT1 & lt; span style = "color: # 111111;

and print

and margin-left: 0.2 AM "TEXT2 TEXT3 & lt; / span & gt; / Code> Function Our Nav Displaying our list of widgets

 [u'TEXT1 ', u'TEXT2', u'TEXT3 ']

, note that the element is returned The list does not have python wire, they are navigable string element - it's ok to sort them out, but if you want to get string content, then you Unicode (& lt; element & gt;) For example:

  nss = extractNavigableStrings (soup) string = [NIS for ns.string]] print strings # [u'TEXT1 ', u 'TEXT2', 'UITTECTA 3'] in strings: print type (s), s # & lt; Type 'Unicode' & gt; TEXT1 # & lt; Type 'Unicode' & gt; TEXT2 # & lt; Type 'Unicode' & gt; TEXT3




Posted by



Unknown




at

03:22











Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest




No comments:







Post a Comment