Python3.5 BeautifulSoup4 ottiene il testo da ‘p’ in div

Sto cercando di estrarre tutto il testo dalla categoria div ‘contenuto ricercabile del contenuto del messaggio’. Questo codice stampa solo l’HTML senza il testo dalla pagina web. Cosa mi manca per ottenere il testo?

Il seguente link è nel file ‘finteredcasesdoc.text’:
http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html

import requests from bs4 import BeautifulSoup with open('filteredcasesdoc.txt', 'r') as openfile1: for line in openfile1: rulingpage = requests.get(line).text soup = BeautifulSoup(rulingpage, 'html.parser') doctext = soup.find('div', class_='caselawcontent searchable-content') print (doctext) 

 from bs4 import BeautifulSoup import requests url = 'http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html' soup = BeautifulSoup(requests.get(url).text, 'html.parser') 

Ho aggiunto un metodo .find molto più affidabile (chiave : valore )

 whole_section = soup.find('div',{'class':'caselawcontent searchable-content'}) the_title = whole_section.center.h2 #eg Missouri Court of Appeals,Southern District,Division Two. second_title = whole_section.center.h3.p #eg STATE of Missouri, Plaintiff-Appellant v.... number_text = whole_section.center.h3.next_sibling.next_sibling #eg the_date = number_text.next_sibling.next_sibling #authors authors = whole_section.center.next_sibling para = whole_section.findAll('p')[1:] #Because we don't want the paragraph h3.p. # we could aslso do findAll('p',recursive=False) doesnt pickup children 

Fondamentalmente, ho sezionato l’intero albero come per i Paragrafi (ad es. Il testo principale, il para ), dovrai print(authors) loop print(authors)

 # and you can add .text (eg print(authors.text) to get the text without the tag. # or a simple function that returns only the text def rettext(something): return something.text #Usage: print(rettext(authorts)) 

Prova a stampare doctext.text . Questo eliminerà tutti i tag HTML per te.

 from bs4 import BeautifulSoup cases = [] with open('filteredcasesdoc.txt', 'r') as openfile1: for url in openfile1: # GET the HTML page as a string, with HTML tags rulingpage = requests.get(url).text soup = BeautifulSoup(rulingpage, 'html.parser') # find the part of the HTML page we want, as an HTML element doctext = soup.find('div', class_='caselawcontent searchable-content') print(doctext.text) # now we have the inner HTML as a string cases.append(doctext.text) # do something useful with this !