This is an old website post.
New post address:
https://www.pisciottablog.com/2018/04/01/web-scraping-with-python/
+++ This wordpress.com domain will not be kept up-to-date anymore+
+++ Please, use the new one (pisciottablog.com) ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Install Beautiful Soup from terminal:
pip install beautifulsoup4
I wrote some personalized functions using such module. Note: since it can just work with HTML files, we need to use urllib to open any url and make it readable as done in following code. The first part, before the functions, must be always written even for all codes in this post.
from bs4 import BeautifulSoup import urllib import re #this library imports regular expressions html_doc="example.html" r = urllib.urlopen(html_doc).read() soup = BeautifulSoup(r, 'html.parser') def imgs(src_content=""): my_list=[] for e in soup.find_all('img', src=re.compile(src_content)): my_list.append(e["src"]) return my_list def tag_attr(tag, attr_list,noattr_list=""): dict1=dict.fromkeys(attr_list, True) dict2=dict.fromkeys(noattr_list, False) dict3=dict(dict1, **dict2) dict4=[] for e in soup.find_all(tag,attrs=dict3): dict4.append(e.string); return dict4 def tag(tag,_dict=""): dict1=[] for e in soup.findAll(tag, attrs=_dict): dict1.append(e.string); return dict1
Function’s explanation with examples:
imgs(src_content=””) | list of imgs url (not exact) matching the given src value. |
tag_attr(tag, attr_list,noattr_list=””) | list of given tag having attrs given in ‘attr_list’ and not having attrs given in ‘noattr_list’. |
list=tag_attr(“”,[“class”,”label”]) | list of all tags having attributes “class” AND “label”. |
tag(tag,_dict=””) | list of given tag having given attributes with given values. |
list=tag(“div”,{“class”:[“class1″,”class2”]}) | list of divs having “class1” OR “class2” as classes |
list=tag(“p”) | list of every tag |
Now let’s take a look at some functions directly from the module.
In the following code regular expressions can be used:
soup.findAll('div',{"class":[re.compile(".")],"label":[re.compile(".")]})[0].string
It gives the content (.string) of the first ([0]) div having any charachter (.) as class and label value.
soup.findAll("a", text="price")
It finds all links (a) tags having “price” as text.
As last example we assume our html to be the following
<a href="address">text</a>18A
By executig this code:
l=[] for el in soup.find_all("span", text=re.compile("[0-9]A")): el2=el.previous_element.parent l.append(el) print(l)
Result for el2: address
Result for el: 18A
Note: more results will accumulated in “l” list!
Leave a comment