Initial Release
In simple terms, Web scraping, web harvesting, or web data extraction is an automated process of collecting large data(unstructured) from websites. The user can extract all the data on particular sites or the specific data as per the requirement. The data collected can be stored in a structured format for further analysis.
Steps involved in web scraping:
It’s that simple !!
What is Beautiful Soup? {#what-is-beautiful-soup?}
Beautiful Soup is a pure Python library for extracting structured data from a website. It allows you to parse data from HTML and XML files. It acts as a helper module and interacts with HTML in a similar and better way as to how you would interact with a web page using other available developer tools.
lxml
and html5lib
to provide organic Python ways of navigating, searching, and modifying the parse tree.While in the iris-web-scraping folder, open a terminal and enter :
docker-compose up
The very first time, it may take a few minutes to build the image correctly and install all the needed modules for Python.
Following this link, access the production : Access the Production
docker-compose down
The example url here is :
url : “http://quotes.toscrape.com/”
The webpage that we are gonna scrape data from is a simple website for webscraping training, this is the simplest page to scrap, but if you are interested you can try the others.
We are going to scrap the Quotes and the Authors
We will be using two Python libraries.
These were automatically installed at start up.
If you go on “http://quotes.toscrape.com/”, and inspect the page, you will be able to see the elements of the html and you’d be able to understand what to scrap.
As you can see, we have a div class="quote"
that contains all the quote we want to scrap. Then, in each of these div, we have a span class="text"
and a small class="author"
.
We now know what we want to gather and how to access them.
First we need to requests the HTML from the website and parse it into a bs4 object :
req = requests.get(request.url)
soupdata = bs4.BeautifulSoup(req.text, features="html.parser")
Here is the code that need to be changed for another webpage
Access src/python/bo/py
in the on_scrap_request
function.
We will be using the findAll functionality on BeautifulSoup to look for all the tags which contains the type div and the class quote :
divs = soupdata.findAll("div",{"class":"quote"})
Then, for each quote, we want to get the type span and class text, and the type small and class author :
for i in range(len(divs)):
text = divs[i].find("span",{"class":"text"}).text
author = divs[i].find("small", {"class":"author"}).text
We then put all those results in our IRIS message and send them back to you, the user.
You must access the Production
following this link :
http://localhost:52795/csp/irisapp/EnsPortal.ProductionConfig.zen?PRODUCTION=iris.Production
And connect using :
SuperUser
as username and SYS
as password.
To call the scraping, click on the Python.ScrapingOperation
, and select in the right tab Actions
, you can Test
the production.
In this test window, select :
Type of request : Grongier.PEX.Message
For the classname you must enter :
msg.ScrapRequest
And for the json, you must enter the url you want to scrap :
{
"url":"http://quotes.toscrape.com/"
}
From here press Invoke Testing Service
and watch the visual trace.
By going on the last message and clicking on contents
you shall see the scraped data.
Here is the simplest example of scraping, it can be easily used by anyone and is implemented on IRIS, this means that with just a few tweaks you can connect this Operation to a CRUD API or to a automatic service that gather data from the web and input it into the IRIS DATABASE
This last link is in fact a link to a Formation in Python on IRIS that shows how to use this module properly and how to connect to the IRIS DB or an external PostGres DB and doing so using a CRUD API.
See this post on the DC as my inspiration to do this GitHub.