Web Scraping with Python

This is an old website post.

New post address:
https://www.pisciottablog.com/2018/04/01/web-scraping-with-python/
+++ This wordpress.com domain will not be kept up-to-date anymore+
+++ Please, use the new one (pisciottablog.com) ++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Install Beautiful Soup from terminal:

pip install beautifulsoup4

I wrote some personalized functions using such module. Note: since it can just work with HTML files, we need to use urllib to open any url and make it readable as done in following code. The first part, before the functions, must be always written even for all codes in this post.

from bs4 import BeautifulSoup
import urllib
import re #this library imports regular expressions

html_doc="example.html"
r = urllib.urlopen(html_doc).read()
soup = BeautifulSoup(r, 'html.parser')

def imgs(src_content=""):
	my_list=[]
	for e in soup.find_all('img', src=re.compile(src_content)):
		my_list.append(e["src"])
	return my_list
def tag_attr(tag, attr_list,noattr_list=""):
	dict1=dict.fromkeys(attr_list, True)
	dict2=dict.fromkeys(noattr_list, False)
	dict3=dict(dict1, **dict2)
	dict4=[]
	for e in soup.find_all(tag,attrs=dict3):
		dict4.append(e.string);
	return dict4

def tag(tag,_dict=""):
	dict1=[]
	for e in soup.findAll(tag, attrs=_dict):
		dict1.append(e.string);
	return dict1

Function’s explanation with examples:

imgs(src_content=””)	list of imgs url (not exact) matching the given src value.
tag_attr(tag, attr_list,noattr_list=””)	list of given tag having attrs given in ‘attr_list’ and not having attrs given in ‘noattr_list’.
list=tag_attr(“”,[“class”,”label”])	list of all tags having attributes “class” AND “label”.
tag(tag,_dict=””)	list of given tag having given attributes with given values.
list=tag(“div”,{“class”:[“class1″,”class2”]})	list of divs having “class1” OR “class2” as classes
list=tag(“p”)	list of every tag

Now let’s take a look at some functions directly from the module.
In the following code regular expressions can be used:

soup.findAll('div',{"class":[re.compile(".")],"label":[re.compile(".")]})[0].string

It gives the content (.string) of the first ([0]) div having any charachter (.) as class and label value.

soup.findAll("a", text="price")

It finds all links (a) tags having “price” as text.
As last example we assume our html to be the following

<a href="address">text</a>18A

By executig this code:

l=[]
for el in soup.find_all("span", text=re.compile("[0-9]A")):
	el2=el.previous_element.parent
	l.append(el)

print(l)

Result for el2: address
Result for el: 18A
Note: more results will accumulated in “l” list!

Share this:

Related

Leave a comment Cancel reply