I am trying to scrape the text using Python + beautiful soup. I type in the code:
soup.find_all ("span") [0]
It gets me:
< Pre>
& lt; Span style = "margin-right: 0.9m" & gt; TEXT1 & lt; Span style = "color: # 111111; margin-left: 0.2 amy" & gt; TEXT2 & lt; / Span & gt; TEXT3 & lt; / Span & gt;
This is great but the problem is that I am removing TEXT1
, TEXT2
, and TEXT3
Want to
I do not know how to do this. If I do:
soup.find_all ("span") [0] .find_all ("span"), only me
& lt; Span style = "color: # 111111; margin-left: 0.2 AM" & gt; TEXT2 & lt; / Span & gt;
I think this is because in this particular & lt; Span & gt;
and & lt; / Span & gt; Both include. How do I pick TEXT1
, TEXT2
, and TEXT3
?
With a little formatting, we can see what kind of structure you have:
& lt; Span style = "margin-right: 0.9m" & gt; TEXT1 & lt; Span style = "color: # 111111; margin-left: 0.2EM" & gt; TEXT2 & lt; / Span & gt; TEXT3 & lt; / Span & gt;
Therefore, unfortunately, we can not use an approach because we want to access the NavigableText
elements in many depths.
The only way to do this is to create a function that checks the children of a given element (which is called context
under the following functions) , And if they are NavigableText
elements, then assemble them
Consider: navigableString, tag def extractNavigableStrings (context) from bs4.element import: string = [] context.children In: Children: if Isinstance (e, navigableString): strings.append (e) ifinstance (e, tag): string.andend (extractivevigative strings (e)) Signal string
How we can run on your input:
Beautiful soup from bs4 import Import Import Navigating String, tag def extractNavigableStrings (reference): strings = [] context.children for e: ifinstance (e, navigable string): strings.append (e) if iStance (e, tag): strings.andend (ExtractNavigative String (E) Return String Soup = Beautiful Soup ('' '& lt; span style = "margin-right: 0.9em"> TEXT1 & lt; span style = "color: # 111111;
and print
and margin-left: 0.2 AM "TEXT2 TEXT3 & lt; / span & gt; / Code> Function Our Nav Displaying our list of widgets
[u'TEXT1 ', u'TEXT2', u'TEXT3 ']
, note that the element is returned The list does not have python wire, they are navigable string
element - it's ok to sort them out, but if you want to get string content, then you Unicode (& lt; element & gt;) For example: nss = extractNavigableStrings (soup) string = [NIS for ns.string]] print strings # [u'TEXT1 ', u 'TEXT2', 'UITTECTA 3'] in strings: print type (s), s # & lt; Type 'Unicode' & gt; TEXT1 # & lt; Type 'Unicode' & gt; TEXT2 # & lt; Type 'Unicode' & gt; TEXT3