Create Web Scraper Python

Build A Web Scraper Python
Creating A Web Scraper Python

Recently I come across a tool that takes care of many of the issues you usually face while scraping websites. The tool is called Scraper API which provides an easy to use REST API to scrape a different kind of websites(Simple, JS enabled, Captcha, etc) with quite an ease. Before I proceed further, allow me to introduce Scraper API.

Jan 05, 2021 In this article, we’re going to talk about how to perform web scraping with python, using Selenium in the Python programming language. Web scraping, also called web data extraction, refers to the technique of harvesting data from a web page through leveraging the patterns in the page’s underlying code. It’s a lightweight web browser with an HTTP API, implemented in Python 3 using Twisted and QT5. Essentially we are going to use Splash to render Javascript generated content. Run the splash server: sudo docker run -p 8050:8050 scrapinghub/splash.

What is Scraper API

If you visit their website you’d find their mission statement:

Scraper API handles proxies, browsers, and CAPTCHAs, so you can get the HTML from any web page with a simple API call!

As it suggests, it is offering you all the things to deal with the issues you usually come across while writing your scrapers.

Development

Scraper API provides a REST API that can be consumed in any language. Since this post is related to Python so I’d be mainly focusing on requests library to use this tool.

You must first signup with them and in return, they will provide you an API KEY to use their platform. They provide 1000 free API calls which are enough to test their platform. Otherwise, they offer different plans from starter to the enterprise which you can view here.

Let’s try a simple example which is also giving in the documentation.

Build A Web Scraper Python

</div><table><tbody><tr><td><div><div>2</div><div>4</div><div>6</div><div>8</div><div>10</div></div></td><td><div><div><span>API_KEY</span><span>=</span><span>'<YOUR API KEY>'</span></div><div><span>r</span><span>=</span><span>requests</span><span>.</span><span>get</span><span>(</span><span>'http://api.scraperapi.com'</span><span>,</span><span>params</span><span>=</span><span>payload</span><span>,</span><span>timeout</span><span>=</span><span>60</span><span>)</span></div><div><span>print</span><span>(</span><span>r</span><span>.</span><span>text</span><span>)</span></div></div></td></tr></tbody></table><p>Assuming you are registered and have got an API which you can find on the dashboard, you can start working right away after having it. When you run this program it shows the IP address of your request.</p><p>Do you see, every time it returns a new IP address, cool, isn’t it?</p><img src='https://bs-uploads.toptal.io/blackfish-uploads/uploaded_file/file/253814/image-1589553330104-3887f4e1986e94fea6b7b2fbf7a2fbcb.png' alt='Create Web Scraper Python' title='Create Web Scraper Python' /><p>There are some scenarios where you would like to use the same proxy to give the impression that a single user is visiting a different part of the website. For that, you can pass <code>session_number</code> parameter in the <code>payload</code> variable above.</p><img src='https://first-web-scraper.readthedocs.io/en/latest/_images/xls-2.png' alt='Create Web Scraper Python' title='Create Web Scraper Python' /><img src='https://miro.medium.com/max/5028/1*xcT5-nsnAfixPqjWB8JYtg.png' alt='Create Web Scraper Python' title='Create Web Scraper Python' /><div><textarea wrap='soft' readonly='>URL_TO_SCRAPE = 'https://httpbin.org/ip' payload = {'api_key': API_KEY, 'url': URL_TO_SCRAPE,'session_number': '123'} r = requests.get('http://api.scraperapi.com', params=payload, timeout=60) print(r.text)

2 4	payload={'api_key':API_KEY,'url':URL_TO_SCRAPE,'session_number':'123'} r=requests.get('http://api.scraperapi.com',params=payload,timeout=60)

And it’d produce the following result:

Can you notice the same proxy IP here?

Creating OLX Scrapper

Like previous scraping related posts, I am going to pick OLX again for this post. I will iterate the list first and then will scrape individual items. Below is the complete code.

</div><table><tbody><tr><td><div><div>2</div><div>4</div><div>6</div><div>8</div><div>10</div><div>12</div><div>14</div><div>16</div><div>18</div><div>20</div><div>22</div><div>24</div><div>26</div><div>28</div><div>30</div></div></td><td><div><div><span>payload</span><span>=</span><span>{</span><span>'api_key'</span><span>:</span><span>API_KEY</span><span>,</span><span>'url'</span><span>:</span><span>URL_TO_SCRAPE</span><span>,</span><span>'session_number'</span><span>:</span><span>'123'</span><span>}</span></div><div><span>r</span><span>=</span><span>requests</span><span>.</span><span>get</span><span>(</span><span>'http://api.scraperapi.com'</span><span>,</span><span>params</span><span>=</span><span>payload</span><span>,</span><span>timeout</span><span>=</span><span>60</span><span>)</span></div><div><span>if</span><span>r</span><span>.</span><span>status_code</span><span>200</span><span>:</span></div><div><span>soup</span><span>=</span><span>BeautifulSoup</span><span>(</span><span>html</span><span>,</span><span>'lxml'</span><span>)</span></div><div><span>all_links</span><span>.</span><span>append</span><span>(</span><span>'https://www.olx.com.pk'</span><span>+</span><span>l</span><span>[</span><span>'href'</span><span>]</span><span>)</span></div><div><span>idx</span><span>=</span><span>0</span></div><div><span>if</span><span>len</span><span>(</span><span>all_links</span><span>)</span><span>></span><span>0</span><span>:</span></div><div><span>sleep</span><span>(</span><span>5</span><span>)</span></div><div><span>payload</span><span>=</span><span>{</span><span>'api_key'</span><span>:</span><span>API_KEY</span><span>,</span><span>'url'</span><span>:</span><span>link</span><span>,</span><span>'session_number'</span><span>:</span><span>'123'</span><span>}</span></div><div><span>if</span><span>idx</span><span>></span><span>1</span><span>:</span></div><div><span>r</span><span>=</span><span>requests</span><span>.</span><span>get</span><span>(</span><span>'http://api.scraperapi.com'</span><span>,</span><span>params</span><span>=</span><span>payload</span><span>,</span><span>timeout</span><span>=</span><span>60</span><span>)</span></div><div><span>if</span><span>r</span><span>.</span><span>status_code</span><span>200</span><span>:</span></div><div><span>soup</span><span>=</span><span>BeautifulSoup</span><span>(</span><span>html</span><span>,</span><span>'lxml'</span><span>)</span></div><div><span>price_section</span><span>=</span><span>soup</span><span>.</span><span>find</span><span>(</span><span>'span'</span><span>,</span><span>{</span><span>'data-aut-id'</span><span>:</span><span>'itemPrice'</span><span>}</span><span>)</span></div></div></td></tr></tbody></table><h2 id='creating-a-web-scraper-python'>Creating A Web Scraper Python</h2><p>I am using <code>Beautifulsoup</code> to parse HTML. I have only extracted Price here because the purpose is to tell about the API itself than <em>Beautifulsoup</em>. You should see my post here in case you are new into scraping and Python.</p><h2>Conclusion</h2><p>In this post, you learned how to use Scraper API for scraping purposes. Whatever you can do with this API you can do it by other means as well; this API provides you everything under the umbrella, especially rendering of pages via Javascript for which you need headless browsers which, at times become cumbersome to set things up on remote machines for headless scraping. Scraper API is taking care of it and charging nominal charges for individuals and enterprises. The company I work with spend 100s of dollars on a monthly basis just for the proxy IPs.</p><p>Oh if you sign up here with my referral link or enter promo code <span><strong>adnan10</strong></span>, you will get a <span><strong>10% discount</strong></span> on it. In case you do not get the discount then just let me know via email on my site and I’d sure help you out.</p><p>In the coming days, I’d be writing more posts about Scraper API discussing further features.</p><p><span><em><strong>Planning to write a book about Web Scraping in Python. Click here to give your feedback</strong></em></span></p><p><br /></p><br><br><a href='https://netlify.mix-goapp.com/download-powerpoint-for-my-mac.html#qg=AAASFAcHXwBRFwYGBlYGVFlRUgpJUEpUWU1SGkRcAE4XUhYHFARETkMbRltZCB8CBAhJAklDSl5fS1ZcGFQDBk8CSEpRTwRVAVMeBhpUH1ZbFDBhSQILBBZXUkVfUAQcSlAUFkhNQgBLFkEWBQcWAFRq' target='_blank'><img src='https://cdn-ak.f.st-hatena.com/images/fotolife/r/ruriatunifoefec/20200910/20200910011341.png' style='cursor:pointer;display:block;margin-left:auto;margin-right:auto;'></a><br><br></p>
					</div>
		
		<footer class="entry-meta">
		
           						</footer>
	</article>
				<nav class="nav-single">
					<div class="assistive-text">Post navigation</div>
					<span class="nav-previous"><a href='/remote-desktop-0x2407'>Remote Desktop 0x2407</a></span>
					<span class="nav-next"><a href='/onenote-to-evernote'>Onenote To Evernote</a></span>
				</nav>

				
<div id="comments" class="comments-area">

		
	
</div>
			
		</div>
	</div>


		
	 		<div id="secondary" class="widget-area" role="complementary">
			<div class="widget widget_search">
				<form role="search" method="get" id="searchform" class="searchform" action="#">
				<div>
					<label class="screen-reader-text" for="s">Search for:</label>
					<input type="text" value="" name="s" id="s" />
					<input type="submit" id="searchsubmit" value="Search" />
				</div>
			</form>			</div>
			<div class="widget widget_recent_entries">
				<p class="widget-title">MOST POPULAR ARTICLES</p>
				<ul>	<li><a href='/using-git-in-visual-studio-code'>: Using Git In Visual Studio Code</a></li>
<li><a href='/pass-by-meaning-in-hindi'>: Pass By Meaning In Hindi</a></li>
<li><a href='/corolla-crossover-2021'>: Corolla Crossover 2021</a></li>
<li><a href='/chrome-cleanup-mac'>: Chrome Cleanup Mac</a></li>
<li><a href='/dpd-woocommerce'>: Dpd Woocommerce</a></li>
<li><a href='/minecraft-official-site'>: Minecraft Official Site</a></li>
<li><a href='/how-to-sync-mobile-contacts-to-gmail'>: How To Sync Mobile Contacts To Gmail</a></li>
	
</ul>
			</div>
			
	  
	  
		</div>
		</div>
	<footer id="colophon" role="contentinfo">
		<div class="site-info">
		<div class="footercopy">Loadimport352</div>
		<div class="footercredit"></div>
		</div>
		</footer>
		
				<div class="clear"></div>
</div>



</body>
</html>