data aggregation,

The Dirty State of Web Scraping

Charles Charles Follow Oct 21, 2017 · 2 mins read
The Dirty State of Web Scraping
Share this

Relied on by most modern businesses in one form or fashion web scraping and it’s cousin web automation are some of the most useful yet least talked about technologies driving businesses today.

While I’m discussing both web scraping and automation, I will be referring to both as simply “web scraping” for brevity. For clarity, web scraping is the art of data collection from one or more web pages or websites whereas web automation is the scripting of everyday tasks such as visiting a web page, filling out a form and clicking submit.

Perhaps it’s because of the unsavory groups of people collecting vast amounts of data to do evil things like filling our inboxes with SPAM and telemarketing calls during dinner but web scraping has gotten an undeserved bad name from the media when it provides a lot of good.

In this post, I’d like to focus on the positive and discuss some of the significant benefits that we’ve gotten from these technologies. For example, search engines, aggregated news, and product comparisons are just a few that many of us use on a daily basis.

Some of the less well known but essential include finance sites like Mint, travel sites like Skyscanner, and job search websites like Indeed. In many cases, the services not only scape and collect data but also login to 3rd party sites and retrieve private or personal information on our behalf. It’s even how new small startup firms often first “integrate” with large established traditional companies that promote blocking competition over improving the customer experience.

To take things a step further web scraping is also helping us to make significant progress in machine learning and data science through the collection of vast amounts of training, validation, and test data. Previously, it was only large companies with deep pockets that could afford. Thankfully, those days are behind us, and now anyone with Internet access, the necessary programming skills, and a computer can access much of the same data.

If you are familiar with Python and want to see how you can put web scraping to work for you, take a look at an excellent open source framework called Scrapy. With impressive documentation, good examples, and a helpful community it’s a great place to start.

For JavaScript-heavy websites, I recommend using Scrapy with Splash. It’s highly scalable and works with minimal resources. If you run into problems, you can find help via their Gitter chat and Stack Overflow. Github with over 8,000+ results for “scrapy” is also a great place to look for additional examples and resources.

Last but not least, I recommend taking a look at this excellent article on the “Ethics of Web Scraping.” The author outlines what I believe is a good solid approach to doing things the “right” way.

In wrapping up, I say go forth and do good with the technology and share your experiences. I would be great to hear more people talking about the benefits and positive use cases.

As always, if you have questions please leave them in the comments or reach out over Twitter.

Join Newsletter
Get the latest updates without the SPAM!
Charles
Written by Charles Follow
Hi, I am Charles, welcome to my blog!